Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality
Study shows simpler routing in Mixture of Experts models performs as well as complex systems, saving parameters.
Researchers Ivan Ternovtsii and Yurii Bilak have published a paper challenging conventional wisdom about Mixture of Experts (MoE) architectures, which are used in large language models like GPT-4 and Claude. Their study "Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality" demonstrates that increasingly sophisticated routing mechanisms—learned routers, multi-hop trajectories, token-dependent gating—don't actually determine final model performance. They built a geometric MoE (ST-MoE) using simple cosine-similarity routing against learned centroids in a low-dimensional space (d_space = 64), requiring 80% fewer routing parameters than standard linear routers.
Through 62 controlled experiments on WikiText-103 with 76-84M parameter models trained to convergence (50K steps, 1.64B tokens), they found routing topology doesn't determine asymptotic perplexity. Five different cosine-routing variants were statistically equivalent within a 1-perplexity margin, with results confirmed through Two One-Sided Tests across 15 runs and 3 seeds. The finding extended to hash, random-fixed, and top-1 routing, and replicated on OpenWebText with only a 0.03 perplexity gap. While a standard linear router with 5.3× more routing parameters reached perplexity 32.76, their iso-parameter cosine routing closed 67% of this gap.
The mechanistic explanation reveals convergent redundancy: multi-hop updates are collinear (cos(Δh_0, Δh_1) = 0.805), implementing magnitude amplification rather than compositional reasoning. Remarkably, a single learnable scalar could replicate multi-hop performance. As a practical payoff, their zero-shot relative-norm halting technique saves 25% of MoE FLOPs at only +0.12% perplexity cost. This research suggests that much of the complexity in current MoE routing systems may be unnecessary, pointing toward more efficient architectures that maintain performance while reducing computational overhead.
- ST-MoE uses cosine-similarity routing with 80% fewer parameters than standard linear routers
- 62 experiments showed routing variants statistically equivalent within 1 perplexity point
- Zero-shot relative-norm halting saves 25% of MoE FLOPs with only +0.12% perplexity cost
Why It Matters
Could lead to more efficient large language models with simpler architectures, reducing computational costs significantly.