RNTI

MODULAD
Classification automatique d'articles encyclopédiques
In EGC 2022, vol. RNTI-E-38, pp.63-74
Abstract
This article proposes a comparative study of different supervised classification approaches applied to the automatic classification of encyclopaedic articles. Our training corpus is composed of the 17 volumes of text of the Encyclopédie by Diderot and d'Alembert (1751-1772) representing a total of about 70,000 articles. We have experimented different approaches for text vectorization (bag of words and word embeddings) combined with classical machine learning methods, deep learning and BERT architectures. In addition to the comparison of these different approaches, our objective is to automatically identify the domains of the unclassified articles of the Encyclopaedia (about 2400 articles). The best model obtains 83% of average f1-score for the 38 classes. Moreover, our study highlights the difficulty of classifying certain semantically close classes. All the code developed and the results obtained in the framework of this project are available in open-source.