Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement
New AI model reads lips to understand speech in loud rooms, cutting errors by over a third.
A research team led by Fei Su has introduced AVUR-LLM, a novel Large Language Model (LLM) framework designed for robust Audio-Visual Speech Recognition (AVSR). The system addresses a key limitation of previous approaches, which either projected audio and visual features independently or used shallow fusion, limiting effective cross-modal learning. AVUR-LLM's breakthrough is its dual-method architecture: 'Sparse Modality Alignment' for efficient, targeted fusion of audio and visual data streams, and 'Visual Unit-Guided Refinement' to use visual information (like lip movements) to correct errors in the audio transcript. This allows the model to maintain accuracy where audio-only systems fail completely.
The technical achievement is significant, with the model demonstrating state-of-the-art results on the LRS3 dataset. Most notably, under harsh additive-noise conditions at 0 dB Signal-to-Noise Ratio (SNR)—simulating environments like a loud factory floor—it achieved a 37% relative improvement over the baseline system. This performance leap comes from the model's ability to intelligently weigh visual cues when audio is corrupted, rather than simply processing both modalities in parallel. The work, submitted to Interspeech 2026, points toward more reliable voice assistants, transcription services, and communication tools that can function in real-world, noisy settings, moving AI closer to human-like robustness in understanding speech.
- AVUR-LLM uses 'sparse modality alignment' for efficient fusion of audio and visual data, reducing computational load on the LLM.
- Achieved a 37% relative improvement over baseline systems at 0 dB SNR on the LRS3 benchmark.
- The 'visual unit-guided refinement' process uses lip-reading data to correct errors made from noisy audio input.
Why It Matters
Enables reliable voice AI in noisy real-world settings like factories, public spaces, and for users with speech impairments.