RNTI

MODULAD
Rien ne sert de décoder, il faut chercher à point : prédiction de mots-clés inter-domaine par recherche et classement à l'aide de Sentence-Transformers fine-tunés
In EGC 2026, vol. RNTI-E-42, pp.133-144
Abstract
Keyphrase generation remains a challenge for current state-of-the-art methods, which remain heavily centered on neural encoder-decoder architectures. These models often struggle to generalize across domains, or even to generate satisfactory absent keyphrases within their own training domain. Moreover, they require substantial computational resources for marginal performance gains. Our approach introduces an encoder-only architecture dedicated to ranking keyphrases drawn from cross-domain candidate pools. Using the same training set as the decoders but treating it as an index, each test document is processed as a query: we retrieve keyphrases from its nearest neighbors and learn to rank them with a fine-tuned Sentence-Transformer adapted to multiple domains. Training relies on a multiple negatives ranking objective, where each test document is paired with its reference keyphrase among a shared pool of negative candidates. However, since certain keyphrases may be relevant to multiple documents, we mask their loss contributions to avoid penalizing the model for potentially valid associations. This adaptation better captures semantic overlaps between documents. We compare our method against strong seq2seq baselines, evaluating f-score and recall for present and absent keyphrases, out-of-domain robustness, latency, training and inference costs, as well as environmental footprint. Our retrieval-ranking approach matches or surpasses generative baselines while significantly mitigating the limitations inherent to neural decoder architectures. This work thus positions itself as an alternative to seq2seq models for document indexing in knowledge bases using keyphrases.