Classification automatique d'articles encyclopédiques
Abstract
This article proposes a comparative study of different supervised classification approaches
applied to the automatic classification of encyclopaedic articles. Our training corpus is composed
of the 17 volumes of text of the Encyclopédie by Diderot and d'Alembert (1751-1772)
representing a total of about 70,000 articles. We have experimented different approaches for
text vectorization (bag of words and word embeddings) combined with classical machine learning
methods, deep learning and BERT architectures. In addition to the comparison of these
different approaches, our objective is to automatically identify the domains of the unclassified
articles of the Encyclopaedia (about 2400 articles). The best model obtains 83% of average
f1-score for the 38 classes. Moreover, our study highlights the difficulty of classifying certain
semantically close classes. All the code developed and the results obtained in the framework
of this project are available in open-source.