ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models
New robotic AI from Nastaran Darabi and Amit Trivedi improves task success from 30% to 71% when robots get bumped.
Researchers Nastaran Darabi and Amit Trivedi have introduced ProGAL-VLA (Prospective Grounding and Alignment VLA), a new architecture designed to solve a critical flaw in current robotic AI. Existing Vision-Language-Action models often suffer from 'language ignorance,' where they rely on visual shortcuts and ignore subtle changes in human instructions. ProGAL-VLA tackles this by constructing a 3D, entity-centric graph of the scene (called a GSM) and using a 'slow planner' to break tasks into symbolic sub-goals. A novel Grounding Alignment Contrastive (GAC) loss then forces the model to tightly align these sub-goals with specific, grounded entities in the world. Every action the robot takes is conditioned on a verified goal embedding, creating a bottleneck that ensures the robot truly understands the command before acting.
The results are dramatic. On the LIBERO-Plus benchmark, ProGAL-VLA increased task success under physical perturbations—like the robot being bumped—from a baseline of 30.3% to 71.5%. It also reduced language ignorance by 3-4 times and improved entity retrieval Recall@1 from 0.41 to 0.71. Perhaps most impressively, on a Custom Ambiguity Benchmark, the model's ability to detect unclear instructions (measured by AUROC) jumped from 0.52 to 0.81. This allowed it to proactively ask for clarification on 81% of ambiguous inputs, up from just 9%, without harming performance on clear tasks. The verification mechanism increases the mutual information between language and actions, proving that explicit, verified grounding is a powerful path toward more reliable and instruction-sensitive agents.
- More than doubles robustness: Task success under physical perturbations soared from 30.3% to 71.5% on the LIBERO-Plus benchmark.
- Drastically reduces language ignorance: The model is 3-4x more sensitive to changes in human instructions, cutting reliance on visual shortcuts.
- Enables ambiguity awareness: AUROC for detecting unclear commands jumped to 0.81, allowing the robot to ask for clarification 81% of the time.
Why It Matters
This brings us closer to robots that reliably follow complex, changing instructions in messy real-world environments, a key step for practical deployment.