Audio & Speech

Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR

This new architecture could make speech-to-text models smaller and more accurate than ever.

Deep Dive

Researchers have developed a new decoder-only AI model for speech recognition that outperforms larger, more complex systems. The 113-million-parameter model uses a novel 'modality-aware' mixture-of-experts design to process speech and text in a single stack. It achieved a 2.8% word error rate on Librispeech test-clean, beating a 139M baseline (3.2%). On a multilingual Common Voice test, it reduced average error from 12.2% to 10.6% without specialized adaptation modules.

Why It Matters

This breakthrough could lead to more efficient and accurate voice assistants, transcription services, and real-time translation tools.