HSGM uses three map levels?

geometric (navigable areas), semantic (objects/relationships), and decision (goal selection).

Zero-shot framework achieves SOTA on R2R-CE and RxR-CE, outperforming several supervised methods?

Zero-shot framework achieves SOTA on R2R-CE and RxR-CE, outperforming several supervised methods.

Instruction decomposition prevents hallucination and progress forgetting during long-horizon navigation?

Instruction decomposition prevents hallucination and progress forgetting during long-horizon navigation.

Research & Papers

HSGM map lets VLMs navigate 3D spaces without any training

arXiv cs.CV June 02, 2026

⚡Zero-shot navigation beats supervised methods using a multi-level semantic-geometric map.

Deep Dive

Vision-Language Navigation (VLN) has long been limited by a critical gap: while VLMs excel at 2D visual understanding and language, they struggle with 3D spatial reasoning and action dynamics. In a new paper, Kailing Li and six colleagues introduce the Hierarchical Semantic-Geometric Map (HSGM) to solve this. HSGM transforms raw 3D point clouds into a structured multi-channel top-down map with three distinct levels. The geometric level marks navigable regions and obstacles; the semantic level encodes objects and their relationships; and the decision level supports high-level task reasoning and goal selection. This map effectively bridges 2D and 3D by giving VLMs a spatial representation they can actually interpret.

During navigation, the VLM acts as a semantic planner, reading the HSGM to pick geometrically valid waypoints, while a classical path-planning algorithm handles collision-free movement between those points. This decouples semantic reasoning from low-level execution, making the system more robust. Complex instructions are automatically decomposed into subtasks to prevent progress forgetting or hallucination over long horizons. Tested on the standard R2R-CE and RxR-CE benchmarks, HSGM achieves state-of-the-art zero-shot performance—and even beats several supervised approaches. The code is available on GitHub, opening the door for more reliable, training-free robot navigation guided by natural language.

Key Points

HSGM uses three map levels: geometric (navigable areas), semantic (objects/relationships), and decision (goal selection).
Zero-shot framework achieves SOTA on R2R-CE and RxR-CE, outperforming several supervised methods.
Instruction decomposition prevents hallucination and progress forgetting during long-horizon navigation.

Why It Matters

Enables robots to follow complex instructions in unfamiliar spaces without expensive training, unlocking practical autonomous navigation.

Read Original Article

HSGM map lets VLMs navigate 3D spaces without any training

Why It Matters

Related Articles

🚀 Stay Ahead in AI