Benchmarking Interaction, Beyond Policy: a Reproducible Benchmark for Collaborative Instance Object Navigation
New benchmark separates navigation from question-asking, enabling 70x faster AI agents.
A research team from multiple institutions has published a new paper introducing QAsk-Nav, a benchmark designed to rigorously evaluate AI agents in Collaborative Instance Object Navigation (CoIN) tasks. In CoIN, an embodied agent must navigate to a target object described in natural language, but can only see from a first-person view and must ask clarifying questions to a human when faced with visual ambiguity. The key innovation of QAsk-Nav is that it decouples the evaluation of navigation skill from collaborative dialogue, allowing researchers to separately measure an agent's ability to move and its ability to ask useful questions. The benchmark includes a lightweight question-asking protocol and an enhanced navigation protocol with diverse, high-quality object descriptions.
To demonstrate the utility of their benchmark, the researchers also built Light-CoNav, a new AI model for collaborative navigation. This unified model is significantly more efficient than previous modular approaches, being 3x smaller and an impressive 70x faster. Crucially, Light-CoNav also outperforms state-of-the-art methods in generalizing to completely new objects and environments it wasn't trained on. The team has released an open-source dataset containing 28,000 quality-checked reasoning and question-asking traces to support training and analysis, providing a foundational resource for the research community to build upon and compare future models.
- Introduces QAsk-Nav, the first benchmark to separately score navigation and collaborative question-asking in AI agents.
- Includes an open-source dataset of 28,000 reasoning traces for training and analyzing interactive AI capabilities.
- Demonstrates Light-CoNav, a new model that is 3x smaller and 70x faster than prior methods while improving generalization.
Why It Matters
This provides a crucial framework for developing more efficient, communicative AI assistants that can collaborate with humans in complex, real-world environments.