RNTI

MODULAD
MonoASR: un modèle de reconnaissance vocale multilingue frugal et unifié
In EGC 2026, vol. RNTI-E-42, pp.169-180
Abstract
Automatic Speech Recognition (ASR) converts spoken language into text and remains a major challenge. Recent models, such as Massively Multilingual Speech (MMS), cover hundreds of languages but require the addition of language-specific adapters, which increases parameter cost and hinders scalability, especially for low-resource languages. We introduce MonoASR, a frugal and unified multilingual system that avoids such adapters through a Universal Language Projection (ULP). ULP associates a learned language token with acoustic representations, enabling the same model and parameters to handle different languages. Evaluated on French (a high-resource language), Arabic, and Kabyle 2 (underrepresented and complex languages), MonoASR achieves lower word error rates (WER) than MMS, demonstrating its robustness, generalization ability, and suitability for low-cost multilingual transcription. Code is available at : https://github.com/ilyesqlm/MonoASR