Adaptive Batch Scaling unlocks large-batch training for RL
New metric Behavioral Divergence lets RL dynamically adjust batch sizes for peak performance.
Conventional wisdom holds that large-batch training is fundamentally incompatible with reinforcement learning (RL) due to non-stationary data distributions. Jongchan Park challenges this view by observing that non-stationarity evolves over training: early stages require small batches for plasticity, while late stages approach quasi-stationarity where large batches enable precise convergence. This insight drives the proposed Adaptive Batch Scaling (ABS), which dynamically adjusts effective batch size according to policy stability.
ABS introduces Behavioral Divergence, a metric that quantifies policy non-stationarity by measuring action-level shifts between consecutive updates, and scales batch size inversely to this metric. Integrated with the Parallelised Q-Network (PQN) algorithm and tested on the ALE benchmark, ABS seamlessly reconciles early-stage plasticity with late-stage stable convergence. Strikingly, the combination of larger networks and larger batch sizes achieves the best performance—a scaling behavior previously considered unattainable in RL. This work opens up new scaling laws for RL training, potentially accelerating development of more capable autonomous agents.
- Proposes Behavioral Divergence, a metric measuring action-level policy shifts between consecutive updates to quantify non-stationarity.
- Dynamically scales batch size inversely to policy volatility, allowing small batches early for plasticity and large batches late for convergence.
- Achieves best performance with large networks and large batches on ALE benchmark, challenging long-held RL assumptions.
Why It Matters
Unlocks efficient large-batch RL training, enabling faster convergence and more capable autonomous systems with fewer computational constraints.