Agent Frameworks

ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search

New benchmark with 2,691 tasks challenges AI to find people using vague clues and limited questions.

Deep Dive

A team from KAIST has introduced ARGOS, the first benchmark and framework that transforms multi-camera person search into an interactive, agentic reasoning problem. Instead of a simple image-matching task, an ARGOS agent is given a vague witness description and must strategically decide what to ask, when to invoke spatial or temporal analysis tools, and how to interpret ambiguous answers—all within a strict turn budget. Reasoning is grounded in a novel Spatio-Temporal Topology Graph (STTG) that encodes real-world camera connectivity and empirically validated transition times, forcing the AI to think like a human investigator piecing together clues.

The benchmark itself is substantial, comprising 2,691 distinct tasks spread across 14 real-world scenarios and organized into three progressive tracks: semantic perception (Who), spatial reasoning (Where), and temporal reasoning (When). This structure tests an AI's ability to handle increasingly complex dimensions of the search problem. Initial experiments using four different large language model (LLM) backbones reveal the challenge's difficulty, with the best models achieving a Task Weighted Score (TWS) of just 0.383 on the spatial track and 0.590 on the temporal track. Crucially, ablation studies confirm the necessity of the specialized tools, showing that removing them causes a dramatic accuracy drop of up to 49.6 percentage points, proving that raw LLM capability is insufficient for this grounded, multi-step reasoning.

Accepted to the CVPR 2026 Workshop on Multimodal Spatial Intelligence, ARGOS establishes a new, rigorous testbed for evaluating AI agents in a realistic investigative context. It moves beyond passive perception, demanding active planning, strategic information gathering, and robust reasoning under uncertainty—core skills for the next generation of autonomous AI systems intended to operate in complex physical environments like airports, campuses, or smart cities.

Key Points
  • Reformulates person search as an agentic task where AI must plan questions and use tools under a limited turn budget, based on a Spatio-Temporal Topology Graph (STTG).
  • Comprehensive benchmark with 2,691 tasks across 14 real-world scenarios, testing 'Who', 'Where', and 'When' reasoning progressively.
  • Current LLMs perform poorly (best TWS: 0.383 on Track 2), and accuracy plummets by up to 49.6% without the domain-specific spatial/temporal tools.

Why It Matters

Sets a new standard for testing AI's real-world reasoning and planning skills, crucial for security, retail analytics, and autonomous surveillance systems.