VideoSEAL: New AI framework fixes evidence misalignment in long video understanding
AI agents get right answers for wrong reasons—VideoSEAL decouples planning from verification.
Long video question answering is notoriously difficult because relevant visual evidence is sparse and scattered across time. Current multimodal large language models (MLLMs) excel at short clips but falter on longer videos, often resorting to multi-turn agentic interactions that can produce correct answers for the wrong reasons—a phenomenon the researchers call 'evidence misalignment.' Through two new diagnostics—temporal groundedness and semantic groundedness—the team identified that this misalignment is exacerbated by prompt pressure from saturated shared contexts at inference time and reward pressure from outcome-only optimization during training.
To solve this, VideoSEAL introduces a decoupled planner-inspector framework. The planner handles the long-horizon search and retrieval, while a separate inspector performs pixel-level verification before final answer generation. This separation prevents the conflation of planning authority with answer authority. On four long-video benchmarks, the framework not only improves accuracy but also produces interpretable search trajectories. It achieves 55.1% on LVBench and 62.0% on LongVideoBench. A key practical benefit: the decoupled architecture scales consistently with increased search budgets and supports plug-and-play upgrades of the MLLM backbone without retraining the planner. The paper was accepted to ICML 2026, and code/models are open-sourced.
- VideoSEAL introduces a decoupled planner-inspector framework to fix 'evidence misalignment' in long video AI agents
- Achieves 55.1% on LVBench and 62.0% on LongVideoBench with interpretable search trajectories
- Supports plug-and-play MLLM upgrades without retraining the planner; open-sourced on GitHub
Why It Matters
Makes AI video understanding more trustworthy by ensuring answers are backed by actual visual evidence.