Utilité d'un couplage entre Word2Vec et une analyse sémantique latente : expérimentation en catégorisation de données textuelles.
Abstract
We present in this article a study on text vectorization methods for document classification.
We study methods based on word embedding (word2vec), and document embedding (latent
semantic analysis and bag of words associated with various weightings) as well as some com-
binations of this methods. To this end, we evaluate these vectorization approaches by using
three classification models (a multilayer perceptron, a linear vector-support machine based on
stochastic gradient descent optimization and multinomial or Gaussian naïve Bayes classifiers).
Our results clearly show that the straightforward combination of word2vec and LSA meth-
ods that we propose, which achieves the association of two complementary definitions of the
context (local for word2vec and global for LSA) of word occurrences, makes it possible to
produce a robust vectorization for texts that is, in general, significantly more discriminating
than the other tested vectorization approaches.