Audio & Speech

Diffusion LLM achieves 19.5% word error rate in visual speech recognition

First-ever diffusion LLM for lip-reading beats prior SOTA on LRS3 benchmark

Deep Dive

Traditional Visual Speech Recognition (VSR) systems rely on left-to-right autoregressive decoding, which forces premature decisions on visually ambiguous tokens before sufficient context is available. Researchers propose DLLM-VSR, the first Diffusion Large Language Model (DLLM) framework for VSR. It formulates transcription as iterative masked denoising with flexible-order decoding. Through confidence-based unmasking, the model commits high-confidence positions early and uses those tokens as bidirectional context to refine ambiguous ones, overcoming the limitations of sequential decoding.

To adapt DLLMs to VSR, the team introduces a two-stage masked-denoising training strategy that separates visual-to-text content alignment from length modeling. They also observe a performance gap with oracle-length decoding and develop length-guided candidate decoding to reduce target-length uncertainty. This technique uses video duration to construct plausible transcript-length hypotheses, decodes under multiple hypotheses, and reranks using length plausibility and decoding confidence. The result: a state-of-the-art 19.5% Word Error Rate on LRS3 using only its labeled training data, significantly improving lip-reading accuracy.

Key Points
  • First Diffusion Large Language Model (DLLM) applied to visual speech recognition, using iterative masked denoising instead of autoregressive decoding
  • Achieves state-of-the-art 19.5% Word Error Rate on LRS3 benchmark using only labeled training data
  • Novel length-guided candidate decoding reduces target-length uncertainty by constructing hypotheses from video duration and reranking by confidence

Why It Matters

Enables more accurate lip-reading from silent video, advancing accessibility tools and surveillance applications with flexible decoding.