ViBE: New framework cuts MoE latency by 45% with hardware-aware expert placement
A 14% SLO boost and 45% lower P90 TTFT for Mixture-of-Experts serving.
Distributed inference for Mixture-of-Experts (MoE) models suffers from stragglers caused by the interplay of workload skew (uneven token routing) and hardware variability (performance differences across nominally identical GPUs). Prior work focuses only on balancing token assignments, ignoring the fact that even balanced loads can land on slower GPUs. ViBE solves this by combining per-GPU performance modeling with expert activation profiling to intelligently place high-load experts on faster devices and low-load experts on slower ones.
ViBE reduces layer-level execution-time imbalance without modifying model semantics or hardware. It also supports lightweight recalibration when workload or performance drifts. In experiments, ViBE improved SLO attainment by 14% and reduced P90 time-to-first-token by up to 45%. The authors show that hardware variability becomes more impactful at scale, making framework like ViBE critical for efficient LLM serving in production.
- ViBE co-optimizes workload skew and hardware variability for MoE serving, reducing stragglers by 14% SLO improvement.
- Uses per-GPU performance modeling and expert activation profiling to place high-load experts on faster GPUs.
- Achieves up to 45% reduction in P90 TTFT without modifying model weights or hardware.
Why It Matters
Hardware-aware expert placement unlocks significant latency gains for MoE models, critical for scalable LLM inference.