Scaling expert parallelism from 4 to 32 ranks on H100s changes per-expert token ratios by at most 5%—the straggler is intrinsic to model routing, not expert placement?

Scaling expert parallelism from 4 to 32 ranks on H100s changes per-expert token ratios by at most 5%—the straggler is intrinsic to model routing, not expert placement.

Mock tokens overestimate routing Gini by up to 2.35x and create false batch-size scaling trends that vanish with real text?

Mock tokens overestimate routing Gini by up to 2.35x and create false batch-size scaling trends that vanish with real text.

Five MoE architectures split into two bands?

data-resilient (MHA, Mamba-2, Gini ~0.105-0.150) vs persistently concentrated (MLA, GDN, Gini >0.24), providing actionable workload inputs for interconnect design.

Research & Papers

DODOCO framework debunks key assumptions in MoE dispatch bottlenecks

arXiv cs.DC May 21, 2026

⚡Mock tokens overestimate routing imbalance by 2.35x, and scaling EP doesn't fix the straggler.

Deep Dive

A new paper from FAU Erlangen-Nürnberg introduces DODOCO (Diagnosing Overhead in Dispatch Operations: Cross-architecture Observatory), a framework that systematically tests the two foundational assumptions behind current AlltoAll dispatch mitigation strategies in Mixture-of-Experts (MoE) models. The interconnect community has proposed four families of mitigations—predictive sample placement, adaptive expert relayout, hierarchical collectives, and EP-aware topology—all resting on the idea that routing imbalance is correctable by the system layer and that mock-token benchmarks faithfully represent production routing. DODOCO puts both to the test.

By instrumenting five MoE checkpoints (DeepSeek-V2-Lite MLA, DeepSeek-MoE-16B MHA, Qwen3-30B GQA, Nemotron-30B Mamba-2, Qwen3.5-35B GDN) across a 5×6 grid of data conditions and scaling expert parallelism from 4 to 32 ranks on H100 GPUs, the team finds both assumptions fail. Scaling EP changes the per-expert max/mean token ratio by at most 5% within each architecture's measurable range—the straggler is intrinsic to the routing decision the model makes, not to how experts land on ranks. Mock tokens overestimate routing Gini by up to a factor of 2.35 and fabricate a batch-size scaling trend that disappears the moment real text replaces random IDs. More surprisingly, the five architectures cleave into two stable bands: MHA and Mamba-2 (data-resilient) drop to Gini 0.105 and 0.150 on wikitext, while MLA and GDN (persistently concentrated) stay above 0.24 on every real-text condition. GQA sits in between. These architectural bands, not EP degree or mock-data profiles, should be the primary inputs to AlltoAll-aware dispatch design.

Key Points

Scaling expert parallelism from 4 to 32 ranks on H100s changes per-expert token ratios by at most 5%—the straggler is intrinsic to model routing, not expert placement.
Mock tokens overestimate routing Gini by up to 2.35x and create false batch-size scaling trends that vanish with real text.
Five MoE architectures split into two bands: data-resilient (MHA, Mamba-2, Gini ~0.105-0.150) vs persistently concentrated (MLA, GDN, Gini >0.24), providing actionable workload inputs for interconnect design.

Why It Matters

For engineers scaling MoE models, this means routing imbalance is architectural, not fixable by system tricks alone.

Read Original Article

DODOCO framework debunks key assumptions in MoE dispatch bottlenecks

Why It Matters

Related Articles

🚀 Stay Ahead in AI