Corpus processing, statistics, and title detection improvement
This branch was originally to improve corpus processing, with the following changes :
- Adding a Corpus class and methods to retrieve useful stats.
- Integrating that class in corpus_processing.py
- Changing document.py and collection.py to retrieve info for statistics (mainly metadata)
This branch is up-to-date with synchro_eric, which also has numerous changes , such as :
- better logging mechanisms
- better title detection
- better table detection
- better text retrieval for logical sections
- etc...