Mega-ASR cuts word error rate by 30% in complex acoustic environments
New ASR framework achieves 45.69% WER on VOiCES, beating prior best by 8.3 points.
Traditional ASR and large audio-language models struggle in real-world settings due to an "acoustic robustness bottleneck" — they fail under severe, compositional distortions like background noise, reverberation, and overlapping speech. To address this, a team led by Zhifei Xie (Nanyang Technological University and Kunlun Inc.) developed Mega-ASR, a unified framework that combines scalable compound-data construction with a progressive training pipeline. They first built Voices-in-the-Wild-2M, a dataset of 2 million samples covering 7 classic acoustic phenomena (e.g., noise, echo, clipping) and 54 physically plausible compound scenarios (e.g., cafe chatter + train rumble). Then they trained Mega-ASR using Acoustic-to-Semantic Progressive Supervised Fine-Tuning (first focusing on acoustic fidelity, then semantic accuracy) and Dual-Granularity WER-Gated Policy Optimization to directly optimize word error rate.
Mega-ASR's results are striking. On the VOiCES R4-B-F benchmark (real-world far-field recordings), it achieves 45.69% WER versus the prior best 54.01%. On the NOIZEUS Sta-0 stationary noise test set, it scores 21.49% versus 29.34%. In the most challenging compositional scenarios combining multiple distortions, Mega-ASR delivers over 30% relative WER reduction compared to strong open-source (e.g., Whisper) and closed-source (e.g., Google, Azure) baselines. The researchers plan to release the code, models, and dataset, offering a scalable paradigm for robust in-the-wild speech recognition.
- Voices-in-the-Wild-2M dataset includes 2M samples with 7 acoustic phenomena and 54 compound scenario combinations
- Mega-ASR uses Acoustic-to-Semantic Progressive SFT and Dual-Granularity WER-Gated Policy Optimization
- Outperforms prior state-of-the-art by 8.3 points WER on VOiCES and achieves >30% relative WER reduction on complex scenarios
Why It Matters
Enables reliable speech recognition in noisy, real-world settings like smart homes, industrial floors, or crowded public spaces.