Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics
New dataset shows OpenAI's Whisper models lose up to 1.07% accuracy in reverberant rooms.
Researcher Mandip Goswami has introduced Whisper-RIR-Mega, a new benchmark designed to rigorously test the robustness of Automatic Speech Recognition (ASR) systems to real-world room acoustics. The dataset pairs clean speech samples from LibriSpeech with their reverberant counterparts, created by convolving them with real Room Impulse Responses (RIRs) from the extensive RIR-Mega corpus. This structured approach allows for stratified evaluation based on key acoustic metrics like Reverberation Time (RT60) and Direct-to-Reverberant Ratio (DRR). The initial study provides a crucial baseline by evaluating five variants of OpenAI's popular Whisper model family, revealing a consistent but quantified performance gap when AI "hears" in echoey environments versus clean audio.
The technical evaluation of models from Whisper-tiny to Whisper-large-v3 on 1,600 test samples showed that all models suffer from reverberation, with the degradation measured as a 'reverb penalty' in Word Error Rate (WER) ranging from 0.12 to 1.07 percentage points. This quantifies a significant, previously anecdotal, challenge for voice assistants, meeting transcription tools, and other speech AI deployed in offices, cars, or homes. By releasing the full dataset, evaluation code, and baseline results, Goswami provides the community with a reproducible standard to benchmark and, more importantly, improve model robustness. This moves the field beyond clean-lab performance and forces a focus on the messy acoustic realities where these systems are actually used.
- Benchmark tests 5 OpenAI Whisper models (tiny to large-v3) on 1600 paired clean/reverberant speech samples.
- Uses real Room Impulse Responses from RIR-Mega corpus to simulate authentic acoustic environments like offices and halls.
- Finds a consistent 'reverb penalty' degrading Word Error Rate by 0.12 to 1.07 percentage points across all models.
Why It Matters
Exposes a key weakness in real-world speech AI, pushing development towards models that work reliably in echoey offices, cars, and homes.