SpatialPoint: Spatial-aware Point Prediction for Embodied Localization
New vision-language model integrates depth data to tell robots exactly where to act in 3D space.
A research team led by Qiming Zhu has introduced SpatialPoint, a novel AI framework designed to solve the core problem of 'embodied localization'—telling an AI agent exactly where in 3D space to perform an action based on visual input and a language instruction. The system uniquely integrates structured depth data directly into a vision-language model (VLM), allowing it to predict precise 3D coordinates in the camera's frame. This addresses a critical gap, as most current VLMs rely solely on 2D RGB images, forcing them to implicitly reconstruct 3D geometry, which limits their ability to generalize across different environments. SpatialPoint formalizes two types of actionable 3D points: 'touchable points' on surfaces for direct interaction (like grasping) and 'air points' in free space for navigation or placement goals.
To train and evaluate SpatialPoint, the team constructed a massive dataset of 2.6 million samples containing RGB-D (color and depth) images paired with question-answer sequences about both touchable and air points. Extensive experiments showed that explicitly incorporating depth information provides a significant performance boost in embodied spatial reasoning over RGB-only baselines. The real-world utility of the model was demonstrated through deployment on physical robots across three distinct tasks: commanding a robotic arm to grasp an object at a language-specified location, placing an object at a target destination, and navigating a mobile robot to a goal position. This work represents a concrete step toward more capable and spatially-aware embodied AI agents that can understand and act upon 'where' questions in the physical world.
- Integrates depth data into a Vision-Language Model (VLM) to predict precise 3D coordinates for action, moving beyond 2D RGB-only systems.
- Trained and evaluated on a novel 2.6M-sample RGB-D dataset covering 'touchable' (surface) and 'air' (free-space) point queries.
- Successfully deployed on real robots for language-guided grasping, object placement, and navigation, proving real-world applicability.
Why It Matters
Enables robots to understand and act on complex 'place this there' instructions, bridging the gap between language commands and precise physical action.