Polar framework enables scalable RL for AI agents on any harness
New rollout system boosts Qwen3.5-4B by up to 22.6 points on SWE-Bench
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A team of researchers from NVIDIA, UIUC, and other institutions introduced Polar, a new framework that makes reinforcement learning for language agents practical at scale. Polar acts as a rollout layer that sits between any agent harness (like Codex or Claude Code) and a training system. It proxies LLM API calls, records exact token-level interactions, and reconstructs those trajectories faithfully for policy gradient updates. Each rollout node handles prewarming, agent execution, trajectory reconstruction, and evaluation in parallel, exposing asynchronous endpoints that multiple trainers can consume simultaneously. This decoupled design makes Polar harness-agnostic and RL-algorithm-agnostic, solving a key bottleneck in agentic RL.
In validation experiments on software-engineering tasks, Polar paired with simple GRPO improved the Qwen3.5-4B model by significant margins across four popular coding harnesses on the SWE-Bench Verified benchmark: +22.6 points with Codex, +4.8 with Claude Code, +0.6 with Qwen Code, and +6.2 with Pi. The framework also supports offline data generation and includes customizable trajectory reconstruction strategies. Polar builds on the team's prior work, Prorl Agent, and has been registered as one of the NeMo Gym environments, making it accessible to the broader open-source RL community.
- Polar treats any agent harness as a black box, proxying LLM API calls and recording token-level data for RL training
- Using GRPO, Polar improved Qwen3.5-4B by 22.6 points on SWE-Bench Verified with the Codex harness
- Registered as a NeMo Gym environment, Polar is agnostic to training infrastructure and RL algorithms
Why It Matters
Scalable RL for AI agents just got easier—no more custom harness porting for every new tool.