Utilité d'un couplage entre Word2Vec et une analyse sémantique latente : expérimentation en catégorisation de données textuelles.

Oussama Ahmia, Nicolas Béchet, Pierre-François Marteau, Alexandre Garel

In EGC 2019, vol. RNTI-E-35, pp.129-140

Abstract

We present in this article a study on text vectorization methods for document classification. We study methods based on word embedding (word2vec), and document embedding (latent semantic analysis and bag of words associated with various weightings) as well as some com- binations of this methods. To this end, we evaluate these vectorization approaches by using three classification models (a multilayer perceptron, a linear vector-support machine based on stochastic gradient descent optimization and multinomial or Gaussian naïve Bayes classifiers). Our results clearly show that the straightforward combination of word2vec and LSA meth- ods that we propose, which achieves the association of two complementary definitions of the context (local for word2vec and global for LSA) of word occurrences, makes it possible to produce a robust vectorization for texts that is, in general, significantly more discriminating than the other tested vectorization approaches.

Preview See bibtex

Download