Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR
New 'prune as you generate' technique cuts computational waste during LLM training, boosting accuracy by up to 8 points.
A research team led by Haobo Xu has developed a novel method called arrol (Accelerating RLVR via online Rollout Pruning) that addresses the computational inefficiency of training large language models for complex reasoning tasks. Current Reinforcement Learning with Verifiable Rewards (RLVR) methods like GRPO and DAPO require sampling many reasoning 'rollouts' for each prompt, creating substantial computational overhead. More critically, many of these rollouts become either nearly perfect or completely wrong early in generation, providing weak learning signals due to low reward variance within groups.
arrol introduces an intelligent pruning system that operates during the generation process itself. The method trains a lightweight quality predictor on-the-fly to estimate the success probability of partial reasoning chains, allowing it to prune unpromising rollouts early while steering surviving ones toward more balanced correctness. This pruning happens inside the inference engine, with remaining rollouts re-batched for efficient log-probability computation and policy updates.
Across experiments with GRPO and DAPO on Qwen-3 and LLaMA-3.2 models ranging from 1B to 8B parameters, arrol demonstrated significant improvements. The method boosted average accuracy by +2.30 to +2.99 percentage points while achieving up to 1.7x training speedup. Perhaps most impressively, the learned quality head enables test-time scaling that yields up to +8.33 additional gains in average accuracy, effectively turning computational efficiency into better final performance.
The system design represents a practical advance in making RL-based reasoning training more accessible. By integrating pruning directly into the inference engine and maintaining efficient batching, arrol reduces the barrier to training more capable reasoning models without requiring massive computational resources. The code has been made publicly available, potentially accelerating research in this critical area of AI development.
- Achieves up to 1.7x training speedup for RLVR methods like GRPO and DAPO
- Improves average accuracy by +2.30 to +2.99 points on Qwen-3 and LLaMA-3.2 models
- Enables test-time scaling for additional +8.33 point accuracy gains using learned quality predictor
Why It Matters
Makes training sophisticated reasoning AI more efficient and effective, potentially lowering compute costs for developing advanced LLMs.