First Diffusion Large Language Model (DLLM) applied to visual speech recognition, using iterative masked denoising instead of autoregressive decoding?

First Diffusion Large Language Model (DLLM) applied to visual speech recognition, using iterative masked denoising instead of autoregressive decoding

Achieves state-of-the-art 19.5% Word Error Rate on LRS3 benchmark using only labeled training data?

Achieves state-of-the-art 19.5% Word Error Rate on LRS3 benchmark using only labeled training data

Novel length-guided candidate decoding reduces target-length uncertainty by constructing hypotheses from video duration and reranking by confidence?

Novel length-guided candidate decoding reduces target-length uncertainty by constructing hypotheses from video duration and reranking by confidence

Audio & Speech

Diffusion LLM achieves 19.5% word error rate in visual speech recognition

arXiv eess.AS May 28, 2026

⚡First-ever diffusion LLM for lip-reading beats prior SOTA on LRS3 benchmark

Deep Dive

Traditional Visual Speech Recognition (VSR) systems rely on left-to-right autoregressive decoding, which forces premature decisions on visually ambiguous tokens before sufficient context is available. Researchers propose DLLM-VSR, the first Diffusion Large Language Model (DLLM) framework for VSR. It formulates transcription as iterative masked denoising with flexible-order decoding. Through confidence-based unmasking, the model commits high-confidence positions early and uses those tokens as bidirectional context to refine ambiguous ones, overcoming the limitations of sequential decoding.

To adapt DLLMs to VSR, the team introduces a two-stage masked-denoising training strategy that separates visual-to-text content alignment from length modeling. They also observe a performance gap with oracle-length decoding and develop length-guided candidate decoding to reduce target-length uncertainty. This technique uses video duration to construct plausible transcript-length hypotheses, decodes under multiple hypotheses, and reranks using length plausibility and decoding confidence. The result: a state-of-the-art 19.5% Word Error Rate on LRS3 using only its labeled training data, significantly improving lip-reading accuracy.

Key Points

First Diffusion Large Language Model (DLLM) applied to visual speech recognition, using iterative masked denoising instead of autoregressive decoding
Achieves state-of-the-art 19.5% Word Error Rate on LRS3 benchmark using only labeled training data
Novel length-guided candidate decoding reduces target-length uncertainty by constructing hypotheses from video duration and reranking by confidence

Why It Matters

Enables more accurate lip-reading from silent video, advancing accessibility tools and surveillance applications with flexible decoding.

Read Original Article

Diffusion LLM achieves 19.5% word error rate in visual speech recognition

Why It Matters

Related Articles

🚀 Stay Ahead in AI