Both threshold-based strategies significantly outperform fixed-number approaches in accuracy and speed?

Both threshold-based strategies significantly outperform fixed-number approaches in accuracy and speed

Static-threshold decoding matches autoregressive ASR accuracy while being more efficient?

Static-threshold decoding matches autoregressive ASR accuracy while being more efficient

Audio & Speech

Confidence-based decoding makes diffusion ASR as accurate as autoregressive models

arXiv eess.AS May 29, 2026

⚡Static confidence threshold matches autoregressive accuracy while being significantly faster

Deep Dive

LLM-based automatic speech recognition (ASR) achieves high accuracy but is limited by sequential autoregressive decoding, which is slow. Diffusion Language Models (DLMs) offer a parallel decoding alternative, but their decoding strategies for ASR have remained under-explored. This paper from KAIST presents a systematic evaluation of three decoding schemes: fixed-number of steps, static confidence threshold, and dynamic confidence threshold. The authors propose using Negative Log-Likelihood-based uncertainty as a proxy to measure round-wise accuracy and thus gauge decoding progress.

Results show that both confidence-based threshold strategies significantly outperform fixed-number schemes in both accuracy and speed. The key insight: in ASR, most tokens reach high confidence early, allowing reliable ones to be harvested aggressively while only difficult tokens remain for later rounds. Notably, the static-threshold strategy matches the accuracy of autoregressive decoding while offering superior efficiency. This finding opens the door to building faster, parallel ASR systems that do not sacrifice the accuracy of current sequential models.

Key Points

Evaluated three DLM decoding schemes: fixed-number, static confidence threshold, and dynamic confidence threshold
Both threshold-based strategies significantly outperform fixed-number approaches in accuracy and speed
Static-threshold decoding matches autoregressive ASR accuracy while being more efficient

Why It Matters

Enables faster, parallel speech recognition without trading off accuracy, critical for real-time ASR applications.

Read Original Article

Confidence-based decoding makes diffusion ASR as accurate as autoregressive models

Why It Matters

Related Articles

🚀 Stay Ahead in AI