VideoOdyssey benchmark tests AI on 109-minute videos with continuous reasoning
New benchmark forces AI to track details across 1.8-hour videos, revealing critical gaps.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Real-world long video understanding demands continuous tracking, information integration, and memory retention over extreme durations—a cognitive load most benchmarks fail to capture. Existing tests often use short, isolated clips, missing the challenge of ultra-long-context reasoning. To address this, researchers introduce VideoOdyssey, built around a core metric: continuous certificate length, i.e., the video segment a human must watch to answer a question definitively.
VideoOdyssey includes two subsets: VideoOdyssey-V for pure visual understanding and VideoOdyssey-AV for synchronized audio-visual comprehension. Videos average 109 minutes—spanning documentaries, tutorials, sports, and more—requiring models to reason across 5 granular levels (seconds to hours). Extensive evaluations of current multimodal LLMs reveal they struggle not only with simple retrieval but also with continuous reasoning across varying contexts, fine-grained perception, and non-verbal omni-modal understanding, highlighting fundamental bottlenecks in long-video AI.
- Average video duration of 109 minutes across 11 domains and 54 subcategories.
- Continuous certificate length: 16 minutes for visual (VideoOdyssey-V) and 12.8 minutes for audio-visual (VideoOdyssey-AV).
- 5 granular evaluation levels from seconds to hours, testing models on continuous reasoning and fine-grained perception.
Why It Matters
Long-video understanding is critical for surveillance, media analysis, and autonomous systems—and current models fall far short.