State-of-the-art models still have a large gap to human performance on egocentric video reasoning?

State-of-the-art models still have a large gap to human performance on egocentric video reasoning

Prompting models with hints of 'where' and 'when' to look yields substantial improvements?

Prompting models with hints of 'where' and 'when' to look yields substantial improvements

Research & Papers

Google's Minerva-Ego benchmark shows AI video reasoning still trails humans

arXiv cs.CV May 18, 2026

⚡New benchmark reveals where and when to look is key for egocentric video AI models

Deep Dive

A team of Google researchers (including Arsha Nagrani, Shyamal Buch, and Cordelia Schmid) has released Minerva-Ego, a new benchmark designed to push the boundaries of egocentric video understanding. The benchmark builds on recent high-quality egocentric/embodied video datasets by adding challenging, multi-step multimodal questions paired with spatiotemporally-dense human-annotated reasoning traces. Each trace includes mask annotations pinpointing the objects of interest required to answer the question, enabling fine-grained evaluation of intermediate reasoning steps rather than just final answers.

Early benchmarking shows that even frontier vision-language models lag significantly behind human performance on these tasks. But the researchers identified a promising technique: providing models with explicit hints about 'where' and 'when' to look in the video yields substantial accuracy gains. This suggests that current AI systems lack inherent spatiotemporal attention but can benefit from structured guidance. The findings have direct implications for building better embodied agents—robots and AR systems that need to understand video from a first-person perspective in real time.

Key Points

Minerva-Ego includes spatiotemporally-dense human-annotated reasoning traces with object mask annotations
State-of-the-art models still have a large gap to human performance on egocentric video reasoning
Prompting models with hints of 'where' and 'when' to look yields substantial improvements

Why It Matters

For embodied AI agents, knowing 'where and when' to look is crucial for real-world video understanding.

Read Original Article

Google's Minerva-Ego benchmark shows AI video reasoning still trails humans

Why It Matters

Related Articles

🚀 Stay Ahead in AI