Robotics

GoalVLM: VLM-driven Object Goal Navigation for Multi-Agent System

New framework lets robot teams navigate to novel objects like 'blue mug' without any task-specific training.

Deep Dive

A team of researchers has introduced GoalVLM, a novel framework that allows teams of robots to navigate to objects described in plain language without any prior training on those specific items. This solves a major limitation in robotics, where traditional systems are confined to a fixed, pre-programmed vocabulary of objects. GoalVLM achieves this by integrating a Vision-Language Model (VLM) directly into the robot's decision-making loop. The system uses SAM3 for text-prompted object detection and segmentation, and SpaceOM for spatial reasoning, enabling it to understand commands like 'find the blue mug on the wooden table.'

Each robot agent builds a Bird's-Eye View (BEV) semantic map of its environment using depth-projected voxel splatting. A key component called the Goal Projector then back-projects object detections through calibrated depth data to reliably localize targets on the map. For exploration, a constraint-guided reasoning layer evaluates potential paths through a structured prompt chain that includes scene captioning, room-type classification, and multi-frontier ranking, injecting commonsense priors into the search process.

The system was rigorously evaluated on the challenging GOAT-Bench, which involves navigating to a sequential chain of 5-7 open-vocabulary targets in unseen environments. With just two agents, GoalVLM achieved a competitive 55.8% subtask success rate and 18.3% Success weighted by Path Length (SPL), matching state-of-the-art methods that require extensive task-specific training. Ablation studies confirmed the critical contributions of both the VLM-guided frontier reasoning for efficient exploration and the depth-projected goal localization for accurate target finding. This represents a significant step toward more flexible and general-purpose robotic assistants.

Key Points
  • Achieves 55.8% subtask success rate on GOAT-Bench with 2 agents, navigating chains of 5-7 novel objects.
  • Uses zero-shot learning; requires no retraining for new object categories, interpreting free-form language commands.
  • Integrates VLM for decision-making, SAM3 for detection, and builds BEV semantic maps via depth-projected voxel splatting.

Why It Matters

Enables future warehouse, home, or search-and-rescue robots to understand complex, novel instructions without costly retraining for every new object.