Gladia's rolling buffer ASR routes audio to monolingual models for faster multilingual transcription
A lightweight ASR system using 100M parameter models rivals cloud APIs on code-switching.
Gladia's new real-time multilingual ASR system tackles the challenge of accurate transcription across multiple languages without relying on massive, hardware-intensive models. Instead of using a single multilingual model that struggles with mid-conversation language switches, the system routes audio between smaller, specialized monolingual models, each about 100 million parameters. The core innovation is a coordinator that buffers audio, monitors language confidence from SpeechBrain's language identification, and when a switch is detected above a threshold, it rolls back to the last speech boundary and re-transcribes with the correct model. Users may briefly see incorrect text, but it self-corrects quickly. The system uses Zipformer for low-latency streaming transcription and Silero VAD for speech boundaries, starting transcription immediately without waiting for full language detection.
On inter-utterance code-switching benchmarks, this approach achieves about 13% Word Error Rate (WER), outperforming every other system tested, including cloud APIs. The known limitation is intra-utterance switching (e.g., mid-sentence Spanglish), which degrades to about 41% WER, though still better than open-source alternatives and at a fraction of the size. The project is open-source on GitHub with instructions and detailed benchmark results. A pro tip from the researcher: enabling only expected languages makes the system lighter and boosts language identification accuracy, especially on heavily accented speech.
- Routes audio between 100M parameter monolingual models instead of one large multilingual model.
- Achieves ~13% WER on inter-utterance code-switching, beating cloud APIs.
- Open-source on GitHub, uses Zipformer, Silero VAD, and SpeechBrain.
Why It Matters
Enables accurate real-time multilingual transcription on local hardware, reducing cloud dependency and latency.