Le Processus Powered Dirichlet-Hawkes comme A Priori Flexible pour Clustering Temporel de Textes
Abstract
The textual content of a document and its publication date are intertwined. For example,
the publication of a news article on a topic is influenced by previous publications on similar
issues, according to underlying temporal dynamics. However, it can be challenging to retrieve
meaningful information when textual information conveys little. Furthermore, the textual content
of a document is not always correlated to its temporal dynamics. We develop a method
to create clusters of textual documents according to both their content and publication time,
the Powered Dirichlet-Hawkes process (PDHP). PDHP yields significantly better results than
state-of-the-art models when temporal information or textual content is weakly informative.
PDHP also alleviates the hypothesis that textual content and temporal dynamics are perfectly
correlated. We demonstrate that PDHP generalizes previous work –such as DHP and UP. Finally,
we illustrate a possible application using a real-world dataset from Reddit.