Robotics

3D Action-Head Injection Boosts Robot VLAs by 46.3 Points

Lightweight two-layer MLP module makes VLAs robust to new positions and tasks.

Deep Dive

Vision-Language-Action (VLA) models struggle with spatial generalization (object moved) and task generalization (different instruction for same scene). Existing methods using 2D pixel coordinates or language/visual prompts fail to fix this brittleness. The key insight from the authors is that how the grounding signal is represented and injected matters most—3D point-based representation fed directly into the action head yields dramatic improvements. They introduce a model-agnostic two-layer MLP that computes the relative displacement between the gripper and a grounded 3D point, then injects this spatial embedding through adaptive layer normalization. The module adds minimal parameters and works on top of any pretrained VLA backbone without retraining.

On the LIBERO-PRO benchmark, the method improves GR00T-N1.6's average success rate by 46.3 points (from 31.2 to 77.5) under task perturbation and by 32.1 points (from 28.1 to 60.2) under position perturbation. Comparable gains were achieved for π0.5, confirming backbone agnosticism. These results demonstrate that with adequate 3D grounding, direct injection into the action head unlocks both spatial and task generalization—a simple, effective recipe for making VLAs deployment-ready in unseen environments.

Key Points
  • Achieves 46.3 point gain on task perturbations and 32.1 point gain on position perturbations on LIBERO-PRO for GR00T-N1.6
  • Module is a two-layer MLP using adaptive layer normalization, requiring no VLA backbone changes
  • Works across different models (GR00T-N1.6 and π0.5), proving backbone agnosticism

Why It Matters

Lightweight 3D grounding enables robots to handle unseen scenarios without costly retraining.

📬 Get the top 10 AI stories daily