EVID-Bench reveals AI models fail 57% of video misinformation tests
Even top multimodal models only catch 43% of cleverly edited videos...
A new benchmark from Tao Yu and 19 co-authors, EVID-Bench, reveals that frontier multimodal AI models are woefully unprepared to detect video misinformation when the deception lies outside the frame. The dataset comprises 222 professionally constructed videos spanning 9 manipulation types across 3 categories: AI-generated content, single-source editing (e.g., temporal reordering, selective cutting), and multi-source splicing. Crucially, every sample was hand-curated to be invisible to even the best models when only the video itself is analyzed — meaning the false narrative depends on missing, reordered, or recontextualized evidence that requires cross-video web search to uncover. The test set challenges models to not just flag anomalies but to actively search the open web for related videos and pinpoint how the manipulation distorts reality.
Using a retrieval-augmented verification baseline, nine frontier models (including GPT-4o, Gemini Pro, and Claude 3.5) were evaluated. The best system managed only 61.43% point-level accuracy and a dismal 43.24% video-level accuracy, meaning over half of manipulated videos were not correctly identified as false. AI-generated manipulations were the hardest: models often fixated on irrelevant visual anchors, mistook synthetic artifacts for editorial splicing, or terminated their search before gathering enough evidence to explain the deception. These failures highlight a critical blind spot — current AI can analyze what it sees, but cannot yet reason about what it doesn't see, leaving a dangerous gap in the fight against sophisticated misinformation.
- EVID-Bench includes 222 videos across 9 manipulation types in 3 categories: AI generation, single-source editing, and multi-source editing.
- Top frontier models with retrieval-augmented verification achieved only 43.24% video-level accuracy — meaning they miss the majority of manipulated videos.
- Common error modes: models fixate on irrelevant anchors, misattribute synthetic content to editing, and halt searches prematurely without fully explaining the manipulation.
Why It Matters
As deepfakes and manipulated videos proliferate, current AI defenses remain dangerously inadequate without cross-video web verification.