Confidence-based decoding makes diffusion ASR as accurate as autoregressive models
Static confidence threshold matches autoregressive accuracy while being significantly faster
LLM-based automatic speech recognition (ASR) achieves high accuracy but is limited by sequential autoregressive decoding, which is slow. Diffusion Language Models (DLMs) offer a parallel decoding alternative, but their decoding strategies for ASR have remained under-explored. This paper from KAIST presents a systematic evaluation of three decoding schemes: fixed-number of steps, static confidence threshold, and dynamic confidence threshold. The authors propose using Negative Log-Likelihood-based uncertainty as a proxy to measure round-wise accuracy and thus gauge decoding progress.
Results show that both confidence-based threshold strategies significantly outperform fixed-number schemes in both accuracy and speed. The key insight: in ASR, most tokens reach high confidence early, allowing reliable ones to be harvested aggressively while only difficult tokens remain for later rounds. Notably, the static-threshold strategy matches the accuracy of autoregressive decoding while offering superior efficiency. This finding opens the door to building faster, parallel ASR systems that do not sacrifice the accuracy of current sequential models.
- Evaluated three DLM decoding schemes: fixed-number, static confidence threshold, and dynamic confidence threshold
- Both threshold-based strategies significantly outperform fixed-number approaches in accuracy and speed
- Static-threshold decoding matches autoregressive ASR accuracy while being more efficient
Why It Matters
Enables faster, parallel speech recognition without trading off accuracy, critical for real-time ASR applications.