[R] VLMs Behavior for Long Video Understanding
Vision-language models achieve 100% accuracy with options but 0% on open-ended long video questions.
A viral analysis by a researcher on Reddit has exposed a significant flaw in current Vision-Language Models (VLMs) like GPT-4V or Claude 3. The study tested models on established long-video understanding benchmarks—including Video-MME, MLVU, and LongVideoBench—which feature complex content from films, TV shows, and documentaries. The core finding is stark: when posed with open-ended, multi-step reasoning questions (e.g., "What happened next and why?"), the VLMs consistently failed to provide correct answers, scoring 0% accuracy. This suggests the models lack genuine comprehension of temporal narratives and causal relationships in extended visual sequences.
However, the performance flipped dramatically when the same questions were reformatted as multiple-choice queries with four options. In this constrained setting, the VLMs achieved a perfect 100% accuracy rate. This discrepancy indicates that current models are exceptionally adept at matching patterns and selecting from provided candidates but struggle with generative reasoning and constructing answers from first principles. The research highlights that benchmark datasets heavy on multiple-choice formats may be overestimating AI capabilities, masking a critical deficiency in true, open-ended video understanding necessary for real-world applications like content analysis or autonomous agent instruction.
- VLMs scored 0% accuracy on open-ended, multi-step reasoning questions about long videos from datasets like Video-MME.
- The same models achieved 100% accuracy when questions were converted to a multiple-choice format with four options.
- The finding suggests a major gap in true narrative understanding versus superficial pattern-matching in current vision-language AI.
Why It Matters
This reveals a critical overestimation of AI's video reasoning skills, impacting development of reliable agents for surveillance, content moderation, and autonomous systems.