Audio & Speech

MDM-ASR: Bridging Accuracy and Efficiency in ASR with Diffusion-Based Non-Autoregressive Decoding

Researchers combine diffusion models with Transformers to achieve parallel decoding without sacrificing accuracy.

Deep Dive

A research team from Academia Sinica and National Taiwan University has introduced MDM-ASR, a novel framework that applies diffusion models to automatic speech recognition (ASR) to bridge the gap between accuracy and efficiency. The paper, submitted to Interspeech 2026, addresses a fundamental trade-off in sequence-to-sequence Transformer ASR: autoregressive models deliver strong accuracy but suffer from slow sequential decoding, while non-autoregressive models enable parallel processing but typically show degraded performance.

The technical approach combines a pre-trained speech encoder with a Transformer diffusion decoder conditioned on acoustic features and partially masked transcripts. To mitigate the common training-inference mismatch in diffusion models, the researchers developed Iterative Self-Correction Training, which exposes the model to its own intermediate predictions during training. They also designed a Position-Biased Entropy-Bounded Confidence-based sampler with positional bias to further enhance results.

Experiments across multiple benchmarks demonstrate that MDM-ASR achieves consistent gains over prior non-autoregressive models while maintaining competitive performance with strong autoregressive baselines. Crucially, it retains the parallel decoding efficiency that makes non-autoregressive approaches attractive for real-time applications. This represents a significant step toward practical ASR systems that don't force developers to choose between speed and accuracy, potentially enabling faster transcription services, voice assistants, and real-time captioning with improved reliability.

Key Points
  • Uses Masked Diffusion Models with Transformer architecture for parallel token prediction
  • Introduces Iterative Self-Correction Training to reduce training-inference mismatch
  • Achieves competitive accuracy with autoregressive models while maintaining parallel decoding efficiency

Why It Matters

Enables faster, more accurate speech-to-text for real-time applications like transcription services and voice assistants.