A Subgoal-driven Framework for Improving Long-Horizon LLM Agents
Open-source Gemma3-12B beats GPT-4o on web tasks after milestone-based RL training.
A research team from Meta and University College London has published a breakthrough paper titled "A Subgoal-driven Framework for Improving Long-Horizon LLM Agents." The work addresses a critical weakness in current AI agents: their tendency to fail at complex, multi-step tasks like web navigation. The researchers propose a dual-solution framework. First, they enhance online planning by having agents like Gemini break tasks into subgoals, which alone provides a ~10% absolute success rate boost. Their second and more significant contribution is MiRA (Milestoning your Reinforcement Learning Enhanced Agent), a novel reinforcement learning (RL) training framework.
MiRA tackles the 'sparse reward' problem in RL, where an agent only gets feedback upon ultimate success or failure, making it hard to learn which intermediate actions were correct. Instead, MiRA provides dense, milestone-based reward signals throughout a task. The results are dramatic. When applied to the open-source Gemma3-12B model, its success rate on the WebArena-Lite benchmark skyrocketed from 6.4% to 43.0%. This performance not only represents a nearly 7x improvement but also surpasses much larger proprietary models, including GPT-4-Turbo (17.6%) and GPT-4o (13.9%), as well as the previous open-model state-of-the-art, WebRL (38.4%).
The implications are substantial for the field of autonomous AI agents. The research demonstrates that sophisticated training methodologies can enable smaller, open-source models to outperform far larger, closed models on specific, challenging tasks. This paves the way for more capable, efficient, and transparent agent systems that can reliably handle long sequences of actions in dynamic digital environments, from automating software workflows to conducting complex online research.
- MiRA framework uses milestone-based RL rewards, boosting Gemma3-12B's success rate from 6.4% to 43.0% on WebArena-Lite.
- The enhanced open-source model outperforms GPT-4-Turbo (17.6%) and GPT-4o (13.9%) on the benchmark.
- The subgoal-driven planning component also improved proprietary models like Gemini by ~10% in success rate.
Why It Matters
Enables smaller, open-source AI models to outperform giants like GPT-4o on complex, multi-step digital tasks, democratizing advanced agent capabilities.