Scalable Pretraining of Large Mixture of Experts Language Models on Aurora Super Computer
A team trained a 220-billion parameter Mixture of Experts model on 12,288 Intel GPUs, achieving 90% scaling efficiency.
A team of Intel researchers has demonstrated the scalable pretraining of large Mixture of Experts (MoE) language models on the Aurora supercomputer, a major milestone for Intel's AI hardware. Using their in-house training framework called 'Optimus,' they successfully trained a series of models, culminating in the massive Mula-220B-A10B—a 220-billion parameter model with 10 billion active parameters per forward pass. The training was conducted on the Aurora system, which utilizes 127,488 Intel Ponte Vecchio (PVC) GPU tiles, pushing the compute scaling from 384 to 12,288 tiles while maintaining an impressive scaling efficiency of approximately 90% at the largest scale.
Key to their success was a suite of performance optimizations within the Optimus library. The team developed custom GPU kernels for expert computation and a novel 'EP-Aware sharded optimizer,' which together delivered training speedups of up to 1.71x. They pretrained their models on the OLMoE-mix-0924 dataset, with the largest models seeing 100 billion tokens. The work also included robust reliability and fault-tolerant features to ensure stable, continuous training at an unprecedented scale, showcasing the practical readiness of the hardware-software stack for frontier AI model development.
This research is a significant proof point for Intel's AI accelerator roadmap. By efficiently scaling a state-of-the-art model architecture like MoE on their flagship supercomputer, the team has validated the performance and scalability of the Intel PVC GPU architecture for the most demanding AI workloads. The work provides a blueprint and performance benchmarks for future large-scale training efforts on Intel-based exascale systems.
- Trained a 220B parameter MoE model (Mula-220B-A10B) on 12,288 Intel PVC GPUs with 90% scaling efficiency.
- Used custom 'Optimus' training library with novel kernels and an EP-Aware optimizer for 1.71x speedups.
- Demonstrated scalable pretraining on the Aurora supercomputer, validating Intel hardware for frontier AI workloads.
Why It Matters
Proves Intel's supercomputing hardware can efficiently train frontier AI models, creating more competition in the AI accelerator market.