Research & Papers

ICRL: New RL framework lets LLMs internalize self-critique, boosting reasoning by 7 points

A joint solver-critic training method that eliminates the need for external critique while improving performance.

Deep Dive

ICRL (Internalizing Self-Critique with Reinforcement Learning) addresses a key limitation of current LLM agents: they often fail again on the same query once critique is removed. The framework jointly trains a solver and a critic from a shared backbone. The critic is rewarded based on the solver's subsequent performance gain, incentivizing actionable feedback. To handle distribution shift between critique-conditioned and critique-free behavior, ICRL introduces a distribution-calibration re-weighting ratio that selectively transfers critique-guided improvements compatible with the solver's own prompt distribution. A role-wise group advantage estimation stabilizes joint optimization across the two roles.

Evaluated on diverse benchmarks for agentic and mathematical reasoning tasks using Qwen3-4B and Qwen3-8B backbones, ICRL shows consistent improvements: average gains of 6.4 points over GRPO on agentic tasks and 7.0 points on math reasoning. Notably, the learned 8B critic is comparable to 32B critics while using substantially fewer tokens. This suggests that ICRL can make smaller models achieve performance on par with much larger critics, reducing computational overhead. The framework represents a step toward self-improving AI systems that don't depend on external feedback loops.

Key Points
  • ICRL jointly trains solver and critic from shared backbone, rewarding critic for improving solver's performance
  • Achieves 6.4-point gain on agentic tasks and 7.0-point gain on math reasoning over GRPO baseline
  • Learned 8B critic matches 32B critic performance with significantly fewer tokens

Why It Matters

Makes LLM agents self-improving without external critique, reducing inference costs and enabling autonomous learning.