Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
New benchmark reveals Gemini-3-Pro falls far short of human experts, requiring 3,300 human-hours to create.
A consortium of 19 researchers has launched Video-MME-v2, a next-generation benchmark designed to rigorously test the true capabilities of video understanding models. The project addresses a critical industry problem: existing benchmarks have become saturated, with inflated leaderboard scores masking significant gaps in real-world performance. To combat this, the team developed a progressive tri-level hierarchy that systematically increases task complexity, moving from basic visual information aggregation to temporal dynamics modeling and finally to complex multimodal reasoning. This structure helps pinpoint exactly where models fail.
Beyond just harder questions, Video-MME-v2 introduces a novel "group-based non-linear evaluation" strategy. Instead of scoring individual questions, this method assesses consistency across related queries and coherence in multi-step reasoning, penalizing lucky guesses and rewarding valid logical support. The dataset's quality is backed by an intensive human annotation pipeline involving 12 annotators and 50 reviewers, totaling 3,300 human-hours and up to 5 rounds of quality assurance. Initial results are stark: even the current best model, Google's Gemini-3-Pro, shows a substantial performance gap compared to human experts. The benchmark uncovers a clear hierarchical bottleneck where errors in low-level visual and temporal tasks propagate upward, crippling high-level reasoning. It also reveals that "thinking-based" reasoning in models is often overly dependent on textual cues, performing better with subtitles but faltering in purely visual contexts. By exposing these specific limitations, Video-MME-v2 aims to steer the development of more robust and capable next-generation video MLLMs (Multimodal Large Language Models).
- Introduces a progressive tri-level hierarchy testing visual aggregation, temporal modeling, and complex reasoning.
- Uses a novel group-based non-linear evaluation that penalizes guesswork, requiring consistent, coherent reasoning.
- Reveals a major performance gap; Gemini-3-Pro lags far behind human experts, with errors cascading from low to high-level tasks.
Why It Matters
Sets a new, more rigorous standard for evaluating AI video understanding, pushing models beyond superficial benchmarks toward genuine comprehension.