Voices-in-the-Wild-2M dataset includes 2M samples with 7 acoustic phenomena and 54 compound scenario combinations?

Voices-in-the-Wild-2M dataset includes 2M samples with 7 acoustic phenomena and 54 compound scenario combinations

Mega-ASR uses Acoustic-to-Semantic Progressive SFT and Dual-Granularity WER-Gated Policy Optimization?

Mega-ASR uses Acoustic-to-Semantic Progressive SFT and Dual-Granularity WER-Gated Policy Optimization

Outperforms prior state-of-the-art by 8.3 points WER on VOiCES and achieves >30% relative WER reduction on complex scenarios?

Outperforms prior state-of-the-art by 8.3 points WER on VOiCES and achieves >30% relative WER reduction on complex scenarios

Audio & Speech

Mega-ASR cuts word error rate by 30% in complex acoustic environments

arXiv eess.AS May 20, 2026

⚡New ASR framework achieves 45.69% WER on VOiCES, beating prior best by 8.3 points.

Deep Dive

Traditional ASR and large audio-language models struggle in real-world settings due to an "acoustic robustness bottleneck" — they fail under severe, compositional distortions like background noise, reverberation, and overlapping speech. To address this, a team led by Zhifei Xie (Nanyang Technological University and Kunlun Inc.) developed Mega-ASR, a unified framework that combines scalable compound-data construction with a progressive training pipeline. They first built Voices-in-the-Wild-2M, a dataset of 2 million samples covering 7 classic acoustic phenomena (e.g., noise, echo, clipping) and 54 physically plausible compound scenarios (e.g., cafe chatter + train rumble). Then they trained Mega-ASR using Acoustic-to-Semantic Progressive Supervised Fine-Tuning (first focusing on acoustic fidelity, then semantic accuracy) and Dual-Granularity WER-Gated Policy Optimization to directly optimize word error rate.

Mega-ASR's results are striking. On the VOiCES R4-B-F benchmark (real-world far-field recordings), it achieves 45.69% WER versus the prior best 54.01%. On the NOIZEUS Sta-0 stationary noise test set, it scores 21.49% versus 29.34%. In the most challenging compositional scenarios combining multiple distortions, Mega-ASR delivers over 30% relative WER reduction compared to strong open-source (e.g., Whisper) and closed-source (e.g., Google, Azure) baselines. The researchers plan to release the code, models, and dataset, offering a scalable paradigm for robust in-the-wild speech recognition.

Key Points

Voices-in-the-Wild-2M dataset includes 2M samples with 7 acoustic phenomena and 54 compound scenario combinations
Mega-ASR uses Acoustic-to-Semantic Progressive SFT and Dual-Granularity WER-Gated Policy Optimization
Outperforms prior state-of-the-art by 8.3 points WER on VOiCES and achieves >30% relative WER reduction on complex scenarios

Why It Matters

Enables reliable speech recognition in noisy, real-world settings like smart homes, industrial floors, or crowded public spaces.

Read Original Article

Mega-ASR cuts word error rate by 30% in complex acoustic environments

Why It Matters

Related Articles

🚀 Stay Ahead in AI