Research & Papers

UltraEP achieves 94.3% ideal throughput for MoE models on rack-scale nodes

New load balancer boosts MoE training throughput by 1.49x on 2560 GPUs

Deep Dive

Large-scale expert parallelism (EP) is critical for training and serving frontier MoE models, but it suffers from device-level load imbalance that causes compute stragglers, token all-to-all bottlenecks, and activation-memory spikes. Existing balancers redistribute experts periodically based on historical load, which fails under non-stationary production loads. UltraEP introduces the first exact-load, real-time balancer that rebalances every microbatch and layer on critical paths, leveraging extended scale-up connectivity of rack-scale nodes (RSNs). It uses quota-driven planning to react instantly to post-gating load and executes expert-state transfers with RSN-native persistent tile streaming and relay-based fan-out mitigation.

Averaged across MoE models from 106B to 671B parameters in training and prefill scenarios, UltraEP achieves 94.3% of the force-balanced ideal throughput, delivering a 1.49× improvement over non-balancing. It reduces final inter-rank imbalance from peaks of 4.01 down to just 1.01–1.04. The system's scalability and robustness were validated in production MoE training with 2560 GPUs. This breakthrough makes large-scale MoE deployments far more efficient, enabling faster training cycles and more reliable inference for models that already dominate AI infrastructure.

Key Points
  • UltraEP rebalances every microbatch and layer in real-time using quota-driven planning, unlike periodic historical balancers.
  • Achieves 94.3% of force-balanced ideal throughput, a 1.49× improvement over non-balancing across models from 106B to 671B parameters.
  • Reduces inter-rank imbalance from 1.30–4.01 to 1.01–1.04, validated on production MoE training with 2560 GPUs.

Why It Matters

Real-time load balancing makes large-scale MoE training 49% faster, reducing costs and enabling bigger frontier models.