LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition
Harder than LRS3, with distorted audio tests to stress visual cues...
A team of researchers (Doyeop Kwak, Jeongsoo Choi, Suyeon Lee, Joon Son Chung) released LRS-VoxMM, an in-the-wild benchmark for audio-visual speech recognition (AVSR). The benchmark is built from VoxMM, a dataset of diverse real-world spoken conversations with human-annotated transcriptions. The team selected AVSR-suitable samples and preprocessed them into an LRS-style format for direct compatibility with existing AVSR pipelines. Unlike common benchmarks like LRS3, LRS-VoxMM covers a more diverse range of scenarios and acoustic conditions. To stress-test models under severe degradation, the benchmark also includes distorted evaluation sets with additive noise, reverberation, and bandwidth limitation.
Experimental results confirm that LRS-VoxMM is considerably harder than LRS3. Critically, the contribution of visual information (e.g., lip movements) becomes more evident as the audio signal degrades. This finding underscores the value of multimodal fusion in challenging acoustic environments. The benchmark aims to push AVSR research toward more realistic conditions, encouraging development of robust models that leverage visual cues when audio is noisy or limited. The dataset and code are available on the project page.
- Derived from VoxMM real-world conversations with human-annotated transcripts in LRS-style format
- Includes distorted evaluation sets with additive noise, reverberation, and bandwidth limitation
- Visual information contribution increases significantly as audio degrades, harder than LRS3
Why It Matters
Real-world AVSR testing under acoustic stress pushes lip-reading and multimodal models to new limits.