RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts
New framework recovers 10-70% lost throughput by optimizing kernel choice at runtime.
Researchers Vyom Sharma and Debajyoti Datta have introduced RaMP (Runtime-Aware Megakernel Polymorphism), a novel framework that dynamically selects optimal kernel configurations for Mixture-of-Experts (MoE) inference. Traditional production systems dispatch kernels based solely on batch size, leaving 10-70% of potential throughput untapped. RaMP addresses this by analyzing the runtime expert routing distribution—specifically the histogram of how many experts are activated per token—and using a lightweight four-parameter wave cost model to pick the best kernel from a pool of 134-268 polymorphic configurations.
The framework's performance-region analysis derives optimal configurations from hardware constants alone, correctly predicting behavior across all 8 tested architectures, including 3 unseen ones. With just 10-24 minutes of one-time profiling per model, RaMP achieves a mean regret of 0.93% versus exhaustive search. Applied to Alpha-MoE, it delivers 1.14x speedup with no source modifications. In real-world serving with vLLM, RaMP achieves 1.30x end-to-end speedup over Triton, 1.41x over DeepGEMM, and 1.13x over FlashInfer CUTLASS. The model is kernel-agnostic, depending only on CTA (Cooperative Thread Array) grid geometry, making it widely applicable across different MoE implementations.
- RaMP recovers 10-70% of lost kernel throughput by routing based on expert distribution, not just batch size.
- Achieves 0.93% mean regret vs. exhaustive search with only 10-24 minutes of one-time profiling per model.
- Delivers 1.30x end-to-end speedup in vLLM serving over Triton, 1.41x over DeepGEMM, and 1.13x over FlashInfer CUTLASS.
Why It Matters
RaMP enables faster, more efficient MoE inference without hardware changes, crucial for scaling large language models in production.