Skip to content

Corpus processing, statistics, and title detection improvement

Menel Mahamdi requested to merge corpus-processing into main

This branch was originally to improve corpus processing, with the following changes :

  • Adding a Corpus class and methods to retrieve useful stats.
  • Integrating that class in corpus_processing.py
  • Changing document.py and collection.py to retrieve info for statistics (mainly metadata)

This branch is up-to-date with synchro_eric, which also has numerous changes , such as :

  • better logging mechanisms
  • better title detection
  • better table detection
  • better text retrieval for logical sections
  • etc...

Merge request reports

Loading