Corpus processing, statistics, and title detection improvement (!3) · Merge requests · liriae / Pdfstruct · GitLab

Menel Mahamdi requested to merge corpus-processing into main Aug 21, 2023

This branch was originally to improve corpus processing, with the following changes :

Adding a Corpus class and methods to retrieve useful stats.
Integrating that class in corpus_processing.py
Changing document.py and collection.py to retrieve info for statistics (mainly metadata)

This branch is up-to-date with synchro_eric, which also has numerous changes , such as :

better logging mechanisms
better title detection
better table detection
better text retrieval for logical sections
etc...