Rien ne sert de décoder, il faut chercher à point : prédiction de mots-clés inter-domaine par recherche et classement à l'aide de Sentence-Transformers fine-tunés
Abstract
Keyphrase generation remains a challenge for current state-of-the-art methods, which remain
heavily centered on neural encoder-decoder architectures. These models often struggle
to generalize across domains, or even to generate satisfactory absent keyphrases within their
own training domain. Moreover, they require substantial computational resources for marginal
performance gains.
Our approach introduces an encoder-only architecture dedicated to ranking keyphrases
drawn from cross-domain candidate pools. Using the same training set as the decoders but
treating it as an index, each test document is processed as a query: we retrieve keyphrases from
its nearest neighbors and learn to rank them with a fine-tuned Sentence-Transformer adapted
to multiple domains.
Training relies on a multiple negatives ranking objective, where each test document is
paired with its reference keyphrase among a shared pool of negative candidates. However,
since certain keyphrases may be relevant to multiple documents, we mask their loss contributions
to avoid penalizing the model for potentially valid associations. This adaptation better
captures semantic overlaps between documents.
We compare our method against strong seq2seq baselines, evaluating f-score and recall
for present and absent keyphrases, out-of-domain robustness, latency, training and inference
costs, as well as environmental footprint. Our retrieval-ranking approach matches or surpasses
generative baselines while significantly mitigating the limitations inherent to neural decoder
architectures. This work thus positions itself as an alternative to seq2seq models for document
indexing in knowledge bases using keyphrases.