Représentations lexicales pour la détection non supervisée d'événements dans un flux de tweets : étude sur des corpus français et anglais
Abstract
In this work, we evaluate the performance of recent text embeddings for the automatic
detection of events in a stream of tweets. We model this task as a dynamic clustering problem.
Our experiments are conducted on a publicly available corpus of tweets in English and on a
similar dataset in French annotated by our team. We show that recent techniques based on deep
neural networks (ELMo, Universal Sentence Encoder, BERT, SBERT), although promising on
many applications, are not very suitable for this task. We also experiment with different types
of fine-tuning to improve these results on French data. Finally, we propose a detailed analysis
of the results obtained, showing the superiority of tf-idf approaches for this task.