MonoASR: un modèle de reconnaissance vocale multilingue frugal et unifié
Abstract
Automatic Speech Recognition (ASR) converts spoken language into text and remains a
major challenge. Recent models, such as Massively Multilingual Speech (MMS), cover hundreds
of languages but require the addition of language-specific adapters, which increases
parameter cost and hinders scalability, especially for low-resource languages. We introduce
MonoASR, a frugal and unified multilingual system that avoids such adapters through a Universal
Language Projection (ULP). ULP associates a learned language token with acoustic representations,
enabling the same model and parameters to handle different languages. Evaluated
on French (a high-resource language), Arabic, and Kabyle 2 (underrepresented and complex
languages), MonoASR achieves lower word error rates (WER) than MMS, demonstrating its
robustness, generalization ability, and suitability for low-cost multilingual transcription. Code
is available at : https://github.com/ilyesqlm/MonoASR