PlexRL slashes GPU costs for RL reasoning training by 37%
New cluster-level orchestration cuts wasted idle time in RLVR training jobs
The paper 'PlexRL: Cluster-Level Orchestration of Serviceized LLM Execution for RLVR' from researchers at multiple institutions tackles a fundamental inefficiency in reinforcement learning with verifiable rewards (RLVR) training for large language models. RLVR training suffers from long-tailed rollouts, tool-induced stalls, and asymmetric resource requirements between rollout and training phases, creating idle gaps that local optimizations alone cannot eliminate. The authors identify that while these gaps are unavoidable within individual jobs, they are largely anti-correlated across jobs, making them exploitable at the cluster level.
PlexRL is a cluster-level runtime that centrally manages model placement, state transitions, and function-level scheduling under strict affinity constraints. It time-slices LLM execution across multiple RLVR jobs, filling otherwise idle periods without expensive model migration. Their implementation and evaluation demonstrate a maximum 37.58% reduction in user GPU hour costs, improved effective cluster capacity, and minimal per-job overhead. This work points to a new direction for scaling RLVR training efficiently on shared infrastructure.
- PlexRL reduces GPU hour costs by up to 37.58% by multiplexing LLM services across RLVR training jobs
- System exploits anti-correlated idle gaps between jobs via cluster-level time-slicing without model migration
- Preserves algorithmic flexibility and introduces minimal per-job overhead while improving effective cluster capacity
Why It Matters
Makes expensive RLVR reasoning training more efficient, reducing cost barriers for developing advanced LLM reasoning capabilities.