Demo2Reward optimizes VLM reward model prompts using as few as 3 expert trajectories?

Demo2Reward optimizes VLM reward model prompts using as few as 3 expert trajectories

Requires no extra model training or compute during policy learning — only test-time adaptation?

Requires no extra model training or compute during policy learning — only test-time adaptation

Outperforms zero-shot and few-shot VLM reward models across simulated and real robotic tasks?

Outperforms zero-shot and few-shot VLM reward models across simulated and real robotic tasks

Research & Papers

Demo2Reward optimizes VLM rewards with just 3-10 expert demos

arXiv cs.LG June 02, 2026

⚡No more manual reward engineering — optimize prompts from a handful of demonstrations

Deep Dive

Reinforcement learning relies on accurate reward functions, but hand-crafting them is often impractical, especially in robotics. Recent work uses pretrained Vision-Language Models (VLMs) as zero-shot reward models, but these produce suboptimal rewards with false positives that degrade downstream policies. In a new paper, Christian Gumbsch and colleagues introduce Demo2Reward, a test-time prompt optimization technique that leverages a small set of expert demonstrations (3–10 trajectories) to refine the VLM's language instruction. Crucially, this adaptation happens before policy training and requires no additional model training or computational overhead during RL. The optimized prompt reduces false positive rewards while preserving true positives, leading to more reliable policy learning.

Demo2Reward consistently outperforms existing zero- and few-shot VLM reward models across multiple simulated robotic tasks and policy backbones. The team also validated the approach on a real-world robotic manipulation scenario, showing that policies can be learned without manually engineering a reward function. By eliminating the need for expensive reward design and enabling effective use of limited demonstration data, Demo2Reward offers a practical path to deploying RL in real-world settings where reward specification is a bottleneck.

Key Points

Demo2Reward optimizes VLM reward model prompts using as few as 3 expert trajectories
Requires no extra model training or compute during policy learning — only test-time adaptation
Outperforms zero-shot and few-shot VLM reward models across simulated and real robotic tasks

Why It Matters

Demo2Reward makes RL practical for robotics by eliminating manual reward engineering from just a few demonstrations.

Read Original Article

Demo2Reward optimizes VLM rewards with just 3-10 expert demos

Why It Matters

Related Articles

🚀 Stay Ahead in AI