TritonMoE runs MoE models on AMD and NVIDIA without CUDA
A single Triton kernel handles Mixtral, DeepSeek, Qwen on both A100 and MI300X.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Mixture-of-Experts (MoE) architectures are the backbone of frontier large language models, but their inference is bottlenecked by irregular memory access and expert routing overhead. Existing optimized kernels like Megablocks, Tutel, and FasterMoE are written in CUDA and locked to NVIDIA hardware. Subhadip Mitra’s new paper presents TritonMoE, a fused MoE dispatch kernel written entirely in OpenAI Triton. It performs the full forward pass — router scoring, token permutation, expert GEMMs, and weighted output combination — using only portable Triton primitives. The key innovation is a fused gate+up GEMM kernel that computes both SwiGLU projections from shared L2-cached input tiles with in-register SiLU activation, cutting global memory traffic by 35%.
On an NVIDIA A100, TritonMoE achieves between 89% and 131% of the throughput of Megablocks at inference batch sizes up to 512 tokens, tested across Mixtral-8x7B, DeepSeek-V3, and Qwen2-MoE. Crucially, all 162 correctness tests pass on both the A100 and an AMD MI300X with zero code changes, demonstrating cross-platform portability. The paper also characterizes sensitivity to routing imbalance under Zipfian-skewed expert assignments, noting that with 64+ experts under extreme skew, TritonMoE’s fixed-tile scheduling underperforms Megablocks’ block-sparse layout — motivating dynamic block-to-expert assignment as future work. Code is publicly available.
- TritonMoE is a fused MoE dispatch kernel in OpenAI Triton, not CUDA, enabling portability across NVIDIA and AMD GPUs.
- On an A100, it delivers 89-131% of Megablocks throughput (batch ≤512 tokens) and cuts global memory traffic by 35% via fused gate+up GEMM.
- All 162 tests pass on both A100 and AMD MI300X with zero code changes; the paper identifies a weakness with 64+ experts under extreme routing skew.
Why It Matters
A CUDA-free MoE kernel that runs on AMD GPUs lowers hardware lock-in for large-scale model inference.