Research & Papers

Speculating Experts Accelerates Inference for Mixture-of-Experts

New prefetching technique predicts future AI model experts, cutting CPU-GPU bottlenecks to boost speed.

Deep Dive

A team of researchers including Vivan Madan, Prajwal Singhania, and Ashwinee Panda has introduced a novel optimization called 'Speculating Experts' for Mixture-of-Experts (MoE) models. MoE architectures, like those used in massive language models, activate only a small subset of 'expert' neural networks per token to manage computational cost. However, in memory-limited inference scenarios, these expert weights are often stored on a CPU, creating a major bottleneck as they are slowly transferred to the GPU for each decoding step. The new technique tackles this by using the model's current internal activations to reliably speculate which experts will be needed next, allowing their weights to be prefetched from CPU memory in the background while computation is ongoing.

This speculative prefetching and execution creates significant overlap between compute and memory transfer, which is the key to the performance gain. The paper demonstrates that across multiple MoE architectures, future experts can be predicted with high reliability, and executing these speculated experts generally preserves downstream task accuracy. For cases where speculation alone hurts accuracy, the researchers also developed lightweight estimators to improve prediction hit rates. Integrated into an optimized inference engine, the method delivers up to a 14% reduction in time per output token (TPOT) compared to the standard on-demand loading approach. By releasing the code as open-source, the team provides a practical tool for developers and companies to run large, sparse MoE models more efficiently on hardware with limited GPU memory.

Key Points
  • Uses internal model representations to predict and prefetch future MoE experts, overlapping CPU-GPU data transfer with computation.
  • Achieves up to a 14% reduction in time per output token (TPOT) over standard on-demand expert loading.
  • Maintains model accuracy through reliable speculation and includes lightweight estimators to improve hit rates where needed.

Why It Matters

Enables faster, more cost-effective deployment of massive MoE models like Mixtral 8x7B on consumer or memory-constrained hardware.