Research & Papers

HSGM map lets VLMs navigate 3D spaces without any training

Zero-shot navigation beats supervised methods using a multi-level semantic-geometric map.

Deep Dive

Vision-Language Navigation (VLN) has long been limited by a critical gap: while VLMs excel at 2D visual understanding and language, they struggle with 3D spatial reasoning and action dynamics. In a new paper, Kailing Li and six colleagues introduce the Hierarchical Semantic-Geometric Map (HSGM) to solve this. HSGM transforms raw 3D point clouds into a structured multi-channel top-down map with three distinct levels. The geometric level marks navigable regions and obstacles; the semantic level encodes objects and their relationships; and the decision level supports high-level task reasoning and goal selection. This map effectively bridges 2D and 3D by giving VLMs a spatial representation they can actually interpret.

During navigation, the VLM acts as a semantic planner, reading the HSGM to pick geometrically valid waypoints, while a classical path-planning algorithm handles collision-free movement between those points. This decouples semantic reasoning from low-level execution, making the system more robust. Complex instructions are automatically decomposed into subtasks to prevent progress forgetting or hallucination over long horizons. Tested on the standard R2R-CE and RxR-CE benchmarks, HSGM achieves state-of-the-art zero-shot performance—and even beats several supervised approaches. The code is available on GitHub, opening the door for more reliable, training-free robot navigation guided by natural language.

Key Points
  • HSGM uses three map levels: geometric (navigable areas), semantic (objects/relationships), and decision (goal selection).
  • Zero-shot framework achieves SOTA on R2R-CE and RxR-CE, outperforming several supervised methods.
  • Instruction decomposition prevents hallucination and progress forgetting during long-horizon navigation.

Why It Matters

Enables robots to follow complex instructions in unfamiliar spaces without expensive training, unlocking practical autonomous navigation.