Audio & Speech

SVHalluc benchmark shows AV-LLMs struggle with speech-vision alignment

Open-source AV-LLMs score near-random at aligning speech with visual signals.

Deep Dive

A new benchmark called SVHalluc, accepted at CVPR 2026 and created by Chenshuang Zhang and colleagues, systematically measures how well audio-visual large language models (AV-LLMs) align spoken language with visual content. While existing benchmarks primarily test environmental sound recognition (e.g., a dog barking), SVHalluc targets the richer semantics and temporal structure of human speech. It evaluates models on two key axes: semantic alignment (does the speech content match what's shown?) and temporal alignment (does the timing of speech match visual events?).

Experimental results reveal that open-source AV-LLMs perform near-random accuracy on multiple SVHalluc tasks, indicating a fundamental inability to ground speech in video. In contrast, Gemini 2.5 Pro significantly outperforms these models. The authors attribute the failures to limited cross-modality understanding: models excel at processing single modalities (audio or vision alone) but fail to integrate them when speech carries specific meaning. This work uncovers a critical blind spot in current AV-LLMs and underscores the need for speech-grounded video comprehension training.

Key Points
  • SVHalluc is the first benchmark to evaluate speech-vision hallucination, testing semantic and temporal alignment of speech with video.
  • Open-source AV-LLMs (e.g., LLaVA, Video-LLaMA) achieve near-random accuracy on multiple tasks.
  • Gemini 2.5 Pro significantly outperforms open-source models, but all models show weakness in cross-modality understanding despite strong single-modality performance.

Why It Matters

Reliable speech-vision alignment is essential for real-world AV-LLMs; current models fail, limiting applications in video understanding and interaction.