Research & Papers

DODOCO framework debunks key assumptions in MoE dispatch bottlenecks

Mock tokens overestimate routing imbalance by 2.35x, and scaling EP doesn't fix the straggler.

Deep Dive

A new paper from FAU Erlangen-Nürnberg introduces DODOCO (Diagnosing Overhead in Dispatch Operations: Cross-architecture Observatory), a framework that systematically tests the two foundational assumptions behind current AlltoAll dispatch mitigation strategies in Mixture-of-Experts (MoE) models. The interconnect community has proposed four families of mitigations—predictive sample placement, adaptive expert relayout, hierarchical collectives, and EP-aware topology—all resting on the idea that routing imbalance is correctable by the system layer and that mock-token benchmarks faithfully represent production routing. DODOCO puts both to the test.

By instrumenting five MoE checkpoints (DeepSeek-V2-Lite MLA, DeepSeek-MoE-16B MHA, Qwen3-30B GQA, Nemotron-30B Mamba-2, Qwen3.5-35B GDN) across a 5×6 grid of data conditions and scaling expert parallelism from 4 to 32 ranks on H100 GPUs, the team finds both assumptions fail. Scaling EP changes the per-expert max/mean token ratio by at most 5% within each architecture's measurable range—the straggler is intrinsic to the routing decision the model makes, not to how experts land on ranks. Mock tokens overestimate routing Gini by up to a factor of 2.35 and fabricate a batch-size scaling trend that disappears the moment real text replaces random IDs. More surprisingly, the five architectures cleave into two stable bands: MHA and Mamba-2 (data-resilient) drop to Gini 0.105 and 0.150 on wikitext, while MLA and GDN (persistently concentrated) stay above 0.24 on every real-text condition. GQA sits in between. These architectural bands, not EP degree or mock-data profiles, should be the primary inputs to AlltoAll-aware dispatch design.

Key Points
  • Scaling expert parallelism from 4 to 32 ranks on H100s changes per-expert token ratios by at most 5%—the straggler is intrinsic to model routing, not expert placement.
  • Mock tokens overestimate routing Gini by up to 2.35x and create false batch-size scaling trends that vanish with real text.
  • Five MoE architectures split into two bands: data-resilient (MHA, Mamba-2, Gini ~0.105-0.150) vs persistently concentrated (MLA, GDN, Gini >0.24), providing actionable workload inputs for interconnect design.

Why It Matters

For engineers scaling MoE models, this means routing imbalance is architectural, not fixable by system tricks alone.