Generalizable Dense Reward for Long-Horizon Robotic Tasks
New framework uses LLMs and VLMs to automatically create training rewards, eliminating manual engineering.
A research team from Carnegie Mellon University and Georgia Tech has introduced VLLR (Vision-Language-LM Reward), a novel framework that automatically generates dense rewards for training robotic foundation models on long-horizon tasks. The system addresses a critical bottleneck in robotics: the need for manually engineered reward functions, which are time-consuming and don't generalize well. VLLR combines two components: an extrinsic reward from Large Language Models (LLMs) and Vision-Language Models (VLMs) that recognizes task progress, and an intrinsic reward based on the policy's own self-certainty.
During training, LLMs first decompose a complex task (like "clean the kitchen") into verifiable subtasks. VLMs then estimate progress on these subtasks to initialize the value function for a brief warm-up phase, avoiding the prohibitive inference cost of running these large models throughout full training. The self-certainty component provides per-step intrinsic guidance during Proximal Policy Optimization (PPO) finetuning. Ablation studies showed these components work complementarily: VLM-based initialization primarily improves task completion efficiency, while self-certainty boosts success rates, especially on out-of-distribution tasks.
On the CHORES benchmark covering mobile manipulation and navigation, VLLR demonstrated significant improvements. It achieved up to 56% absolute success rate gains over the original pretrained policy, up to 5% gains over state-of-the-art RL finetuning methods on in-distribution tasks, and up to 10% gains on out-of-distribution tasks. This represents a major step toward general-purpose robots that can learn complex, multi-step tasks without human intervention in reward design, potentially accelerating deployment in real-world environments like homes and warehouses.
- Combines LLM task decomposition with VLM progress estimation to create automatic rewards, eliminating manual engineering
- Achieved 56% absolute success gains on CHORES benchmark and 5-10% improvements over SOTA RL methods
- Uses policy self-certainty as intrinsic reward during PPO finetuning, particularly effective for out-of-distribution tasks
Why It Matters
Enables robots to learn complex, multi-step tasks autonomously, accelerating deployment in homes, warehouses, and other real-world environments.