Rewarding DINO: Predicting Dense Rewards with Vision Foundation Models
New method learns actual task semantics for 24+ Meta-World tasks, enabling real-world generalization.
A team from the University of Freiburg and Bosch has published a new paper, 'Rewarding DINO: Predicting Dense Rewards with Vision Foundation Models,' introducing a novel approach to training robots. The core innovation is a language-conditioned reward model that learns the actual semantics of a task, such as 'pick up the block,' rather than just mimicking the visual steps of a single expert demonstration. This solves a major limitation in robotics, where traditional methods create reward functions biased toward one specific solution path and fail in states not covered by the training data.
Rewarding DINO was trained on data from 24 manipulation tasks in the Meta-World+ benchmark using a rank-based loss. The resulting model is compact enough to serve as a direct, computationally efficient replacement for hand-coded analytical reward functions. In evaluations, it showed strong pairwise accuracy, rank correlation, and calibration. Crucially, it generalized successfully to new settings in both simulation and real-world experiments, proving it learned the underlying task concepts.
The researchers validated the model's utility by integrating it with standard, off-the-shelf reinforcement learning (RL) algorithms. These algorithms were able to successfully solve tasks from the training set using Rewarding DINO's predicted rewards as guidance. This demonstrates a practical pipeline: describe a task in language, let Rewarding DINO interpret it visually to generate reward signals, and use a generic RL algorithm to learn the optimal policy. This moves robotics closer to flexible, language-instructed systems that can understand and accomplish goals without exhaustive pre-programming for every scenario.
- Learns task semantics from language, not just visual imitation of demonstrations.
- Trained on 24 Meta-World+ tasks and generalizes to new real-world settings.
- Enables off-the-shelf RL algorithms to solve tasks by providing dense reward signals.
Why It Matters
Moves robotics toward flexible, language-instructed systems that can understand goals without exhaustive pre-programming for every scenario.