Routes audio between 100M parameter monolingual models instead of one large multilingual model?

Routes audio between 100M parameter monolingual models instead of one large multilingual model.

Achieves ~13% WER on inter-utterance code-switching, beating cloud APIs?

Achieves ~13% WER on inter-utterance code-switching, beating cloud APIs.

Open-source on GitHub, uses Zipformer, Silero VAD, and SpeechBrain?

Open-source on GitHub, uses Zipformer, Silero VAD, and SpeechBrain.

Research & Papers

Gladia's rolling buffer ASR routes audio to monolingual models for faster multilingual transcription

r/MachineLearning June 01, 2026

⚡A lightweight ASR system using 100M parameter models rivals cloud APIs on code-switching.

Deep Dive

Gladia's new real-time multilingual ASR system tackles the challenge of accurate transcription across multiple languages without relying on massive, hardware-intensive models. Instead of using a single multilingual model that struggles with mid-conversation language switches, the system routes audio between smaller, specialized monolingual models, each about 100 million parameters. The core innovation is a coordinator that buffers audio, monitors language confidence from SpeechBrain's language identification, and when a switch is detected above a threshold, it rolls back to the last speech boundary and re-transcribes with the correct model. Users may briefly see incorrect text, but it self-corrects quickly. The system uses Zipformer for low-latency streaming transcription and Silero VAD for speech boundaries, starting transcription immediately without waiting for full language detection.

On inter-utterance code-switching benchmarks, this approach achieves about 13% Word Error Rate (WER), outperforming every other system tested, including cloud APIs. The known limitation is intra-utterance switching (e.g., mid-sentence Spanglish), which degrades to about 41% WER, though still better than open-source alternatives and at a fraction of the size. The project is open-source on GitHub with instructions and detailed benchmark results. A pro tip from the researcher: enabling only expected languages makes the system lighter and boosts language identification accuracy, especially on heavily accented speech.

Key Points

Routes audio between 100M parameter monolingual models instead of one large multilingual model.
Achieves ~13% WER on inter-utterance code-switching, beating cloud APIs.
Open-source on GitHub, uses Zipformer, Silero VAD, and SpeechBrain.

Why It Matters

Enables accurate real-time multilingual transcription on local hardware, reducing cloud dependency and latency.

Read Original Article

Gladia's rolling buffer ASR routes audio to monolingual models for faster multilingual transcription

Why It Matters

Related Articles

🚀 Stay Ahead in AI