GROW: New RL framework for VLM agents beats SOTA on 800+ Minecraft tasks
Decomposing trajectories into state-action samples unlocks multi-turn reinforcement learning for open-world agents.
A team of researchers (Xiongbin Wu, Zhihao Luo, et al.) has introduced GROW, a reinforcement learning framework designed to align GRPO with state-action modeling for vision-language model (VLM) agents operating in open-world environments like Minecraft. Existing approaches rely heavily on supervised fine-tuning with expert demonstrations, but struggle with multi-turn tasks where agents must repeatedly perceive and act. Standard GRPO requires full trajectory samples, leading to excessive context length and noise. GROW overcomes this by decomposing trajectories into individual state-action pairs and computing policy advantages between these samples, preserving GRPO's core optimization signal while enabling effective multi-turn RL.
In experiments spanning over 800 Minecraft tasks, GROW achieved state-of-the-art performance, demonstrating the effectiveness of its sample decomposition strategy. The surrogate analysis provided in the paper shows that even when grouped samples are conditioned on different local states—rather than a single prompt context—the objective maintains the relative optimization strength of GRPO under simplifying assumptions. This work marks a significant step toward training more capable open-world VLM agents, with potential applications in gaming, robotics, and other domains requiring sustained autonomous reasoning and action.
- GROW decomposes collected trajectories into state-action samples to enable multi-turn RL for VLM agents.
- Achieves state-of-the-art performance across 800+ Minecraft tasks, surpassing existing supervised fine-tuning methods.
- Surrogate analysis shows the framework preserves GRPO's relative optimization signal despite local state conditioning.
Why It Matters
Opens the door to more reliable, multi-step AI agents for complex open-world tasks like game automation and robotics.