RAGEN-2: Reasoning Collapse in Agentic RL
Researchers discover 'template collapse' where AI agents fake reasoning, propose new fix.
A major collaboration led by Stanford and the Allen Institute for AI has identified a critical, previously invisible flaw in how large language models (LLMs) like GPT-4 or Claude are trained to become autonomous agents. Their paper, RAGEN-2, reveals that during reinforcement learning (RL) training, agents can suffer from 'template collapse.' Here, the model learns to generate reasoning steps that appear varied but are actually generic templates applied regardless of the specific task input. This breakdown in true reasoning directly harms final performance on complex, multi-turn tasks.
The core problem is that standard training metrics, like entropy, only measure diversity of outputs for a single input. They fail to detect when an agent stops distinguishing between different inputs altogether. The researchers solve this by decomposing reasoning quality into two parts: within-input diversity (Entropy) and cross-input distinguishability, measured via Mutual Information (MI). They found MI is a far stronger predictor of final task success. They trace the root cause to low signal-to-noise ratio (SNR) in training rewards, where weak task-specific gradients are overpowered by regularization, erasing nuanced reasoning.
To combat template collapse, the team proposes a practical solution called SNR-Aware Filtering. This method dynamically selects training prompts with high reward variance in each training iteration, ensuring the model receives stronger, more informative learning signals. When tested across diverse agent benchmarks—including planning, math reasoning (like GSM8K), web navigation (WebShop), and code execution—this approach consistently improved both the agent's input-dependent reasoning and its ultimate task performance, offering a new, more reliable pathway for building robust AI agents.
- Identifies 'template collapse,' a new failure mode where AI agents use fixed reasoning templates regardless of input, invisible to entropy metrics.
- Proposes Mutual Information (MI) as a superior diagnostic metric, correlating more strongly with final performance than entropy across diverse tasks.
- Introduces SNR-Aware Filtering, a training fix that selects high-variance prompts to strengthen learning signals, boosting performance in web navigation and code execution.
Why It Matters
This provides a crucial fix for building reliable, truly reasoning AI agents, moving beyond models that just mimic the appearance of thought.