Audio & Speech

LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition

arXiv eess.AS May 01, 2026

⚡Harder than LRS3, with distorted audio tests to stress visual cues...

Deep Dive

A team of researchers (Doyeop Kwak, Jeongsoo Choi, Suyeon Lee, Joon Son Chung) released LRS-VoxMM, an in-the-wild benchmark for audio-visual speech recognition (AVSR). The benchmark is built from VoxMM, a dataset of diverse real-world spoken conversations with human-annotated transcriptions. The team selected AVSR-suitable samples and preprocessed them into an LRS-style format for direct compatibility with existing AVSR pipelines. Unlike common benchmarks like LRS3, LRS-VoxMM covers a more diverse range of scenarios and acoustic conditions. To stress-test models under severe degradation, the benchmark also includes distorted evaluation sets with additive noise, reverberation, and bandwidth limitation.

Experimental results confirm that LRS-VoxMM is considerably harder than LRS3. Critically, the contribution of visual information (e.g., lip movements) becomes more evident as the audio signal degrades. This finding underscores the value of multimodal fusion in challenging acoustic environments. The benchmark aims to push AVSR research toward more realistic conditions, encouraging development of robust models that leverage visual cues when audio is noisy or limited. The dataset and code are available on the project page.

Key Points

Derived from VoxMM real-world conversations with human-annotated transcripts in LRS-style format
Includes distorted evaluation sets with additive noise, reverberation, and bandwidth limitation
Visual information contribution increases significantly as audio degrades, harder than LRS3

Why It Matters

Real-world AVSR testing under acoustic stress pushes lip-reading and multimodal models to new limits.

Read Original Article

LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition

Why It Matters

Stay Ahead in AI