Research & Papers

DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models

arXiv cs.AI March 20, 2026

⚡A new study with 2,700+ audio tests shows AI models are 'deaf' to emotion and background noise.

Deep Dive

A team of nine researchers has introduced DEAF (Diagnostic Evaluation of Acoustic Faithfulness), a new benchmark designed to test whether Audio Multimodal Large Language Models (Audio MLLMs) like GPT-4o or Claude 3.5 Sonnet genuinely listen to sound or just read the text. The benchmark comprises over 2,700 carefully crafted 'conflict stimuli' that pit acoustic information against textual prompts. For example, a user might ask about a speaker's emotion while the audio contains contradictory cues, such as a sad voice describing a happy event. The framework systematically tests three core acoustic dimensions: emotional tone (prosody), background sounds, and speaker identity.

The evaluation uses a multi-level framework that progressively increases textual influence, from simple semantic conflicts to misleading prompts, to isolate 'prompt-induced sycophancy'—where the model blindly follows the text. The team introduced diagnostic metrics to quantify a model's reliance on textual cues versus acoustic signals. When applied to seven leading Audio MLLMs, the results were stark: while models showed some sensitivity to acoustic variations, their final predictions were overwhelmingly driven by the text. This reveals a significant 'text dominance' problem, indicating that high scores on standard speech benchmarks may not reflect true acoustic understanding, but rather sophisticated text-based inference.

Key Points

Benchmarks 7 Audio MLLMs with 2,700+ conflict stimuli across emotion, background noise, and speaker ID.
Reveals a 'text dominance' problem where models ignore sound cues to follow misleading textual prompts.
Exposes a critical gap between benchmark performance and genuine acoustic understanding in AI.

Why It Matters

For developers, this means current 'audio' AI may fail in real-world scenarios where tone and context are critical, like customer service or content moderation.

Read Original Article

DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models

Why It Matters

Stay Ahead in AI