Apprentissage Conjoint de Représentations d'Auteurs et de Documents
Abstract
Most recent language models use contextualized word embedding, learnt using the Transformer
architecture. They achieve state-of-the art performance on a lot of natural language
processing tasks. Pretrained versions of these models are now widely used, however, their
fine-tuning on specific tasks remains a central question. For example, these methods do not
provide document and author level representations: a simple average of the contextualized
word embedding is not good enough (Reimers et Gurevych, 2019). We develop a simple architecture
based on Variational Information Bottleneck (VIB) to learn author and document
representations using pre-trained contextualized word vectors (Devlin et al., 2019). We evaluate
our method quantitatively and qualitatively on two datasets: a news article corpus, and
a scientific article corpus. Our method produces more robust representations than existing
methods and performs well in author identification and classification.