MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models
Researchers' new method fine-tunes Vision-Language Models to create better rewards, solving spatial grounding issues.
Researchers from multiple institutions developed MARVL, a multi-stage guidance system for robotic manipulation using Vision-Language Models (VLMs). It fine-tunes VLMs for spatial/semantic consistency and decomposes tasks into subtasks with trajectory sensitivity. The system significantly outperforms existing VLM-reward methods on the Meta-World benchmark, demonstrating superior sample efficiency and robustness for sparse-reward manipulation tasks that previously required manual engineering.
Why It Matters
Automates reward design for robots, enabling faster learning of complex manipulation tasks without human intervention.