Processus de Dirichlet profonds pour le topic modeling
Abstract
This paper presents two novel models: the neural Embedded Dirichlet Process and the
neural Embedded Hierarchical Dirichlet Process. Both methods extend the Embedded Topic
Model (ETM) to nonparametric settings, thus simultaneously learning the number of topics,
latent representations of documents, and topic and word embeddings from data. To achieve
this, we replace ETM's logistic normal prior with Dirichlet Processes in a variational autoencoding
inference setting. Our tests on the 20 Newsgroups and on the Humanitarian Assistance
and Disaster Relief datasets show that our models present the advantage of maintaining low
perplexity while providing meaningful representations that outperform that of state of the art
methods. We obtained our results without having to perform costly reruns to find the number
of topics nor having to sacrifice a Dirichlet-like prior.