Audio & Speech

When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration

Major flaw discovered: AI models ignore what you say, reading text instead.

Deep Dive

A new study reveals speech-enabled LLMs overwhelmingly trust conflicting text over audio, ignoring explicit instructions to listen. When audio and text conflict, models like Gemini 2.0 Flash follow the text 10 times more often (16.6% vs 1.6%). This 'text dominance' occurs despite audio embeddings preserving more accurate information (97.2% accuracy) than text transcripts (93.9%). The bias is consistent across 8 languages and 4 state-of-the-art models, exposing a critical reliability flaw.

Why It Matters

This undermines trust in voice assistants and audio AI, revealing they may not be listening to you at all.