COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training
Joint optimization of batch size and parallelism cuts training time by up to 11.1%.
Training large language models involves two interdependent decisions: the global batch size (which affects statistical efficiency) and the 3D parallelism strategy (which affects hardware throughput). Traditional approaches treat these independently—optimization work adapts the batch size while keeping parallelism fixed, and systems work selects the fastest parallelism for a given batch size. This decoupling leads to suboptimal configurations because the throughput-optimal parallelism can shift as batch size changes. Researchers from MBZUAI, CMU, and other institutions present COPUS, a system that co-adapts both parameters dynamically throughout training.
COPUS is guided by a novel metric called Goodput, defined as the product of throughput and statistical efficiency. This metric directly measures useful convergence per unit of wall-clock time, capturing both hardware and statistical effects. The system combines online gradient noise scale estimation under 3D parallelism with throughput-aware evaluation of candidate configurations, enabling efficient reconfiguration without costly trial runs. Evaluated on LLM pre-training workloads across 1–4 nodes of 8xH100 and 8xMI210 GPUs with model sizes from 3B to 32B parameters, COPUS achieved average time-to-convergence speedups of 3.9–8.0% over the fastest baseline, with peak gains up to 11.1% including system overheads. The paper provides 22 pages of detailed analysis, 13 figures, and 2 tables demonstrating the approach across multiple hardware configurations.
- COPUS jointly optimizes global batch size and 3D parallelism, unlike prior independent approaches.
- Uses Goodput (throughput × statistical efficiency) as a unified optimization metric.
- Achieves 3.9–8.0% average speedups on 3B–32B models, with peak gains of 11.1% on H100 GPUs.
Why It Matters
COPUS reduces LLM training time by up to 11%, saving significant compute costs for large-scale AI workloads.