Beyond Closed-Pool Video Retrieval: A Benchmark and Agent Framework for Real-World Video Search and Moment Localization
This new benchmark reveals a major weakness in today's top video AI models.
Researchers introduced RVMS-Bench, a new 1,440-sample benchmark for real-world video search using fuzzy, multi-dimensional memories instead of precise descriptions. They also proposed RACLO, an agentic framework using abductive reasoning to mimic human "Recall-Search-Verify" cognition. Experiments showed existing multimodal large language models (MLLMs) still perform poorly at retrieving videos and locating specific moments based on vague, real-world memory cues, highlighting a significant gap in current AI capabilities.
Why It Matters
It exposes a critical flaw in today's video AI, pushing development towards systems that understand human-like, imperfect recall.