AI Safety

Auditing the Reliability of Multimodal Generative Search

Study of 11,943 claim-video pairs reveals up to 18.7% of Gemini's video citations don't support its claims.

Deep Dive

A new research paper from Erfan Samieyan Sahneh and Luca Maria Aiello presents a large-scale audit of Google's Gemini 2.5 Pro multimodal search system, which retrieves and synthesizes answers from multimedia content like YouTube videos. The study analyzed 11,943 claim-video pairs generated across Medical, Economic, and General domains. Using automated verification with three independent LLM judges (achieving 87.7% inter-rater agreement and validated against human annotations), the researchers found that between 3.7% and 18.7% of video-grounded claims were not substantiated by their cited sources. The failure was rarely outright contradiction; instead, the dominant modes were unverifiable specificities and overstated claims.

This suggests the system often injects precise but ungrounded details from its parametric knowledge while citing videos as evidence, creating a misleading veneer of authority. Exploratory post-hoc analysis via logistic regression identified key properties associated with these failures: claims that depart significantly from the source vocabulary (β = -1.6 to -3.1, p < 0.01) and claims with low semantic similarity to the video transcript (β = -2.1 to -11.6, p < 0.01) are significantly more likely to be unsupported. The findings characterize a critical gap between the confidence these generative search systems project and the actual fidelity of their outputs, highlighting a fundamental trustworthiness issue for users relying on them for information.

Key Points
  • Audit of 11,943 Gemini 2.5 Pro claim-video pairs found 3.7% to 18.7% of citations unsupported.
  • Primary failure was injecting ungrounded, precise details from parametric knowledge while citing videos.
  • Claims with low semantic similarity to video transcripts were significantly more likely to be unsupported (β = -2.1 to -11.6).

Why It Matters

This exposes a critical trust gap in AI search tools that project authority by citing sources they don't fully support.