Research & Papers

Polar framework enables scalable RL for AI agents on any harness

New rollout system boosts Qwen3.5-4B by up to 22.6 points on SWE-Bench

Deep Dive

A team of researchers from NVIDIA, UIUC, and other institutions introduced Polar, a new framework that makes reinforcement learning for language agents practical at scale. Polar acts as a rollout layer that sits between any agent harness (like Codex or Claude Code) and a training system. It proxies LLM API calls, records exact token-level interactions, and reconstructs those trajectories faithfully for policy gradient updates. Each rollout node handles prewarming, agent execution, trajectory reconstruction, and evaluation in parallel, exposing asynchronous endpoints that multiple trainers can consume simultaneously. This decoupled design makes Polar harness-agnostic and RL-algorithm-agnostic, solving a key bottleneck in agentic RL.

In validation experiments on software-engineering tasks, Polar paired with simple GRPO improved the Qwen3.5-4B model by significant margins across four popular coding harnesses on the SWE-Bench Verified benchmark: +22.6 points with Codex, +4.8 with Claude Code, +0.6 with Qwen Code, and +6.2 with Pi. The framework also supports offline data generation and includes customizable trajectory reconstruction strategies. Polar builds on the team's prior work, Prorl Agent, and has been registered as one of the NeMo Gym environments, making it accessible to the broader open-source RL community.

Key Points
  • Polar treats any agent harness as a black box, proxying LLM API calls and recording token-level data for RL training
  • Using GRPO, Polar improved Qwen3.5-4B by 22.6 points on SWE-Bench Verified with the Codex harness
  • Registered as a NeMo Gym environment, Polar is agnostic to training infrastructure and RL algorithms

Why It Matters

Scalable RL for AI agents just got easier—no more custom harness porting for every new tool.