Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing
A new AI model combines Whisper with a diffusion LLM to achieve a 2.25% word error rate on clean speech.
A research team from the University of Cambridge and Tsinghua University has published a new paper, 'Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing,' presenting Whisper-LLaDA. This model explores using a diffusion-based large language model (LLaDA) as an external processing module to refine transcripts from OpenAI's Whisper-LLaMA system. The core innovation is applying the non-autoregressive, denoising capabilities of diffusion models to the task of automatic speech recognition (ASR), a domain traditionally dominated by autoregressive models. The work was accepted to the prestigious ICASSP 2026 conference and all code has been open-sourced.
The study's key finding is that Whisper-LLaDA, when used as a 'deliberation' processor, significantly improves accuracy. On the LibriSpeech benchmark, the best cascade system achieved a 2.25%/4.94% word error rate (WER) on the test-clean/test-other splits, representing a 12.3% relative improvement over the baseline Whisper-LLaMA on the more challenging test-other set. Critically, the researchers found that conditioning the diffusion model on acoustic embeddings was essential; a plain-text LLaDA failed to improve results. While slightly less accurate as a standalone ASR decoder, Whisper-LLaDA offered faster inference than the baseline in most configurations, pointing to a promising trade-off between speed and accuracy for future speech AI systems.
- Achieves 2.25%/4.94% WER on LibriSpeech test-clean/test-other, a 12.3% relative improvement over Whisper-LLaMA.
- Proves audio-conditioned embeddings are critical; a text-only diffusion LLM failed to boost accuracy.
- Demonstrates faster inference than the baseline in most configurations when used as a standalone ASR decoder.
Why It Matters
This research introduces a new, faster architecture for high-accuracy speech recognition, potentially improving real-time transcription and voice assistants.