RNTI

MODULAD
Apprentissage Conjoint de Représentations d'Auteurs et de Documents
In EGC 2021, vol. RNTI-E-37, pp.11-23
Abstract
Most recent language models use contextualized word embedding, learnt using the Transformer architecture. They achieve state-of-the art performance on a lot of natural language processing tasks. Pretrained versions of these models are now widely used, however, their fine-tuning on specific tasks remains a central question. For example, these methods do not provide document and author level representations: a simple average of the contextualized word embedding is not good enough (Reimers et Gurevych, 2019). We develop a simple architecture based on Variational Information Bottleneck (VIB) to learn author and document representations using pre-trained contextualized word vectors (Devlin et al., 2019). We evaluate our method quantitatively and qualitatively on two datasets: a news article corpus, and a scientific article corpus. Our method produces more robust representations than existing methods and performs well in author identification and classification.