Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts
Standard MoE routing fails at domain transitions; new method achieves 0.748 probability (124x improvement).
Sparse Mixture-of-Experts (MoE) models are powerful but struggle at domain boundaries—when a token belongs to one distribution and the next to another. Standard affinity routing assigns only 0.006 probability to the correct expert at such transitions, a catastrophic failure. Man Yung Wong (Russell) proposes three lightweight gate modifications that fix this by recovering the Free Energy Principle from cognitive neuroscience and using leaky integrate-and-fire (LIF) dynamics from spiking neural networks. The modifications are: temporal memory (beta), which accumulates routing context via a per-expert LIF membrane potential; precision-weighted gating (Pi), using inverse variance of recent prediction error to achieve 31x contrast between reliable and unreliable experts; and anticipatory routing, a next-state predictor conditioned on the beta-accumulated hidden state.
In controlled experiments (4 experts, 5 seeds), beta+Ant gates raise transition probability to 0.748 (124x improvement) and reduce the number of experts needed for 99% coverage from infeasible to a small constant. Ablation experiments reveal a super-additive interaction: anticipation alone gives nothing (+0.000), beta alone modest (+0.295), but together they close 75% of the oracle gap (+0.741). In a character-level MoE language model (5 seeds), beta-routing cuts transition-step bits-per-character from 6.56 to 4.01, and beta+Ant places 0.86 probability on the correct domain expert before that domain appears—nearly double standard MoE's 0.42. The paper includes ~200-line reference implementations.
- Standard MoE routing assigns only 0.006 probability to the correct expert at domain transitions; new method achieves 0.748 (+124x).
- Three modifications (beta, Pi, Ant) are inspired by Free Energy Principle and LIF spiking dynamics, with beta+Ant showing super-additive synergy.
- Character-level MoE LM reduces transition BPC from 6.56 to 4.01 and correctly anticipates domain shifts 86% of the time vs 42% standard.
Why It Matters
MoE models can now handle domain transitions efficiently, drastically cutting expert costs and improving language model reliability.