Audio LLMs Ignore Speech 10x More Often, Trusting Text Even When Wrong
Major flaw discovered: AI models ignore what you say, reading text instead.
A new study reveals speech-enabled LLMs overwhelmingly trust conflicting text over audio, ignoring explicit instructions to listen. When audio and text conflict, models like Gemini 2.0 Flash follow the text 10 times more often (16.6% vs 1.6%). This 'text dominance' occurs despite audio embeddings preserving more accurate information (97.2% accuracy) than text transcripts (93.9%). The bias is consistent across 8 languages and 4 state-of-the-art models, exposing a critical reliability flaw.
Why It Matters
This undermines trust in voice assistants and audio AI, revealing they may not be listening to you at all.