Research & Papers

Demo2Reward optimizes VLM rewards with just 3-10 expert demos

No more manual reward engineering — optimize prompts from a handful of demonstrations

Deep Dive

Reinforcement learning relies on accurate reward functions, but hand-crafting them is often impractical, especially in robotics. Recent work uses pretrained Vision-Language Models (VLMs) as zero-shot reward models, but these produce suboptimal rewards with false positives that degrade downstream policies. In a new paper, Christian Gumbsch and colleagues introduce Demo2Reward, a test-time prompt optimization technique that leverages a small set of expert demonstrations (3–10 trajectories) to refine the VLM's language instruction. Crucially, this adaptation happens before policy training and requires no additional model training or computational overhead during RL. The optimized prompt reduces false positive rewards while preserving true positives, leading to more reliable policy learning.

Demo2Reward consistently outperforms existing zero- and few-shot VLM reward models across multiple simulated robotic tasks and policy backbones. The team also validated the approach on a real-world robotic manipulation scenario, showing that policies can be learned without manually engineering a reward function. By eliminating the need for expensive reward design and enabling effective use of limited demonstration data, Demo2Reward offers a practical path to deploying RL in real-world settings where reward specification is a bottleneck.

Key Points
  • Demo2Reward optimizes VLM reward model prompts using as few as 3 expert trajectories
  • Requires no extra model training or compute during policy learning — only test-time adaptation
  • Outperforms zero-shot and few-shot VLM reward models across simulated and real robotic tasks

Why It Matters

Demo2Reward makes RL practical for robotics by eliminating manual reward engineering from just a few demonstrations.