IntentNav lets robots navigate like humans using VLM imitation
Robots now learn human-like search intent from 26-page paper, beating SOTA on 3 benchmarks.
IntentNav, developed by a team of 12 researchers from institutions including Nanyang Technological University and Carnegie Mellon, addresses the challenge of object navigation in unknown environments. The framework learns from human demonstrations by extracting high-level search intent using a novel Frontier-based Human-Intent Labeling technique. This method looks ahead in human trajectories to identify which unexplored frontier best explains future actions. The system then constructs a spatial-visual candidate space, combining BEV memory (tracking explored regions and frontiers) with egocentric visual memory (providing semantic cues). A vision-language model (VLM) policy is trained with an Intent-Aligned Objective to select among these grounded candidates, producing consistent, human-like exploration behavior.
IntentNav achieves state-of-the-art performance on the MP3D, HM3D-v1, and HM3D-v2 ObjectNav benchmarks, surpassing prior methods by significant margins. A key advantage is its zero-shot transfer capability: the candidate-level navigation interface works across wheeled, quadruped, and humanoid robots without any VLM fine-tuning. This suggests the learned spatial-visual representation generalizes across different morphologies. The paper (arXiv:2606.08029) includes 26 pages of technical detail, demonstrating how human demonstration data can be effectively leveraged to create more intuitive and efficient robot navigation policies.
- IntentNav uses Frontier-based Human-Intent Labeling to extract high-level search intent from low-level human actions.
- Achieves SOTA on MP3D, HM3D-v1, and HM3D-v2 ObjectNav benchmarks, outperforming prior navigation methods.
- Zero-shot transfers to wheeled, quadruped, and humanoid robots without any VLM fine-tuning.
Why It Matters
Brings robots one step closer to human-like exploration, enabling efficient navigation in unknown spaces without retraining.