UltraEP rebalances every microbatch and layer in real-time using quota-driven planning, unlike periodic historical balancers?

UltraEP rebalances every microbatch and layer in real-time using quota-driven planning, unlike periodic historical balancers.

Achieves 94.3% of force-balanced ideal throughput, a 1.49× improvement over non-balancing across models from 106B to 671B parameters?

Achieves 94.3% of force-balanced ideal throughput, a 1.49× improvement over non-balancing across models from 106B to 671B parameters.

Reduces inter-rank imbalance from 1.30–4.01 to 1.01–1.04, validated on production MoE training with 2560 GPUs?

Reduces inter-rank imbalance from 1.30–4.01 to 1.01–1.04, validated on production MoE training with 2560 GPUs.

Research & Papers

UltraEP achieves 94.3% ideal throughput for MoE models on rack-scale nodes

arXiv cs.DC June 04, 2026

⚡New load balancer boosts MoE training throughput by 1.49x on 2560 GPUs

Deep Dive

Large-scale expert parallelism (EP) is critical for training and serving frontier MoE models, but it suffers from device-level load imbalance that causes compute stragglers, token all-to-all bottlenecks, and activation-memory spikes. Existing balancers redistribute experts periodically based on historical load, which fails under non-stationary production loads. UltraEP introduces the first exact-load, real-time balancer that rebalances every microbatch and layer on critical paths, leveraging extended scale-up connectivity of rack-scale nodes (RSNs). It uses quota-driven planning to react instantly to post-gating load and executes expert-state transfers with RSN-native persistent tile streaming and relay-based fan-out mitigation.

Averaged across MoE models from 106B to 671B parameters in training and prefill scenarios, UltraEP achieves 94.3% of the force-balanced ideal throughput, delivering a 1.49× improvement over non-balancing. It reduces final inter-rank imbalance from peaks of 4.01 down to just 1.01–1.04. The system's scalability and robustness were validated in production MoE training with 2560 GPUs. This breakthrough makes large-scale MoE deployments far more efficient, enabling faster training cycles and more reliable inference for models that already dominate AI infrastructure.

Key Points

UltraEP rebalances every microbatch and layer in real-time using quota-driven planning, unlike periodic historical balancers.
Achieves 94.3% of force-balanced ideal throughput, a 1.49× improvement over non-balancing across models from 106B to 671B parameters.
Reduces inter-rank imbalance from 1.30–4.01 to 1.01–1.04, validated on production MoE training with 2560 GPUs.

Why It Matters

Real-time load balancing makes large-scale MoE training 49% faster, reducing costs and enabling bigger frontier models.

Read Original Article

UltraEP achieves 94.3% ideal throughput for MoE models on rack-scale nodes

Why It Matters

Related Articles

🚀 Stay Ahead in AI