Arena: Efficiently Training Large Models via Dynamic Scheduling and Adaptive Parallelism Co-Design
New system from academic researchers boosts cluster throughput by 1.6x by co-designing scheduling and parallelism.
A team of researchers from institutions including Shanghai Jiao Tong University and the National University of Singapore has introduced Arena, a novel system designed to tackle the significant inefficiencies in training large-scale AI models like GPT-4 or Llama 3 in shared GPU clusters. The core problem is a mismatch: current cluster schedulers are built for jobs with static parallelism (SP), but modern training frameworks increasingly use adaptive parallelism (AP), which dynamically changes how computation is split across GPUs during a run. This mismatch leads to poor resource utilization, longer job queues, and wasted GPU hours.
Arena's key innovation is a co-design that tightly couples the scheduling layer with the execution layer. It introduces a low-cost, disaggregated profiling method and AP-tailored performance estimation to predict how a job will perform under different resource configurations without expensive trial runs. These components are unified through a 'grid abstraction' that shards the complex joint optimization space of scheduling and parallelism decisions. At runtime, Arena dynamically schedules profiled jobs across the cluster's elasticity (scaling resources up/down) and heterogeneity (mixing different GPU types) dimensions, while executing them with an efficient AP strategy that uses a pruned search space for faster decisions.
Evaluated on heterogeneous testbeds and production workloads, Arena delivered substantial performance gains. It reduced the average job completion time (JCT) by up to 49.3% and improved overall cluster throughput—the number of jobs completed per unit time—by up to 1.6x compared to state-of-the-art schedulers like Pollux and Gandiva. This represents a major step towards more cost-effective and scalable infrastructure for the next generation of massive AI models, directly addressing a critical bottleneck in the industry.
- Co-designs dynamic scheduling and adaptive parallelism (AP) to fix the mismatch in current GPU clusters, using a novel grid abstraction.
- Achieves up to a 49.3% reduction in job completion time and a 1.6x improvement in cluster throughput in evaluations.
- Employs low-cost disaggregated profiling and AP-tailored estimation to make efficient scheduling decisions without costly trial runs.
Why It Matters
This directly lowers the cost and time for companies and labs training frontier AI models, making research and development more efficient.