AI Safety

Auditing the Reliability of Multimodal Generative Search

arXiv cs.CY April 02, 2026

⚡Study of 11,943 claim-video pairs reveals up to 18.7% of Gemini's video citations don't support its claims.

Deep Dive

A new research paper from Erfan Samieyan Sahneh and Luca Maria Aiello presents a large-scale audit of Google's Gemini 2.5 Pro multimodal search system, which retrieves and synthesizes answers from multimedia content like YouTube videos. The study analyzed 11,943 claim-video pairs generated across Medical, Economic, and General domains. Using automated verification with three independent LLM judges (achieving 87.7% inter-rater agreement and validated against human annotations), the researchers found that between 3.7% and 18.7% of video-grounded claims were not substantiated by their cited sources. The failure was rarely outright contradiction; instead, the dominant modes were unverifiable specificities and overstated claims.

This suggests the system often injects precise but ungrounded details from its parametric knowledge while citing videos as evidence, creating a misleading veneer of authority. Exploratory post-hoc analysis via logistic regression identified key properties associated with these failures: claims that depart significantly from the source vocabulary (β = -1.6 to -3.1, p < 0.01) and claims with low semantic similarity to the video transcript (β = -2.1 to -11.6, p < 0.01) are significantly more likely to be unsupported. The findings characterize a critical gap between the confidence these generative search systems project and the actual fidelity of their outputs, highlighting a fundamental trustworthiness issue for users relying on them for information.

Key Points

Audit of 11,943 Gemini 2.5 Pro claim-video pairs found 3.7% to 18.7% of citations unsupported.
Primary failure was injecting ungrounded, precise details from parametric knowledge while citing videos.
Claims with low semantic similarity to video transcripts were significantly more likely to be unsupported (β = -2.1 to -11.6).

Why It Matters

This exposes a critical trust gap in AI search tools that project authority by citing sources they don't fully support.

Read Original Article

Auditing the Reliability of Multimodal Generative Search

Why It Matters

Stay Ahead in AI