Google's Minerva-Ego benchmark shows AI video reasoning still trails humans
New benchmark reveals where and when to look is key for egocentric video AI models
A team of Google researchers (including Arsha Nagrani, Shyamal Buch, and Cordelia Schmid) has released Minerva-Ego, a new benchmark designed to push the boundaries of egocentric video understanding. The benchmark builds on recent high-quality egocentric/embodied video datasets by adding challenging, multi-step multimodal questions paired with spatiotemporally-dense human-annotated reasoning traces. Each trace includes mask annotations pinpointing the objects of interest required to answer the question, enabling fine-grained evaluation of intermediate reasoning steps rather than just final answers.
Early benchmarking shows that even frontier vision-language models lag significantly behind human performance on these tasks. But the researchers identified a promising technique: providing models with explicit hints about 'where' and 'when' to look in the video yields substantial accuracy gains. This suggests that current AI systems lack inherent spatiotemporal attention but can benefit from structured guidance. The findings have direct implications for building better embodied agents—robots and AR systems that need to understand video from a first-person perspective in real time.
- Minerva-Ego includes spatiotemporally-dense human-annotated reasoning traces with object mask annotations
- State-of-the-art models still have a large gap to human performance on egocentric video reasoning
- Prompting models with hints of 'where' and 'when' to look yields substantial improvements
Why It Matters
For embodied AI agents, knowing 'where and when' to look is crucial for real-world video understanding.