TritonMoE is a fused MoE dispatch kernel in OpenAI Triton, not CUDA, enabling portability across NVIDIA and AMD GPUs?

TritonMoE is a fused MoE dispatch kernel in OpenAI Triton, not CUDA, enabling portability across NVIDIA and AMD GPUs.

On an A100, it delivers 89-131% of Megablocks throughput (batch ≤512 tokens) and cuts global memory traffic by 35% via fused gate+up GEMM?

On an A100, it delivers 89-131% of Megablocks throughput (batch ≤512 tokens) and cuts global memory traffic by 35% via fused gate+up GEMM.

All 162 tests pass on both A100 and AMD MI300X with zero code changes; the paper identifies a weakness with 64+ experts under extreme routing skew?

All 162 tests pass on both A100 and AMD MI300X with zero code changes; the paper identifies a weakness with 64+ experts under extreme routing skew.

Research & Papers

TritonMoE runs MoE models on AMD and NVIDIA without CUDA

arXiv cs.DC May 26, 2026

⚡A single Triton kernel handles Mixtral, DeepSeek, Qwen on both A100 and MI300X.

Deep Dive

Mixture-of-Experts (MoE) architectures are the backbone of frontier large language models, but their inference is bottlenecked by irregular memory access and expert routing overhead. Existing optimized kernels like Megablocks, Tutel, and FasterMoE are written in CUDA and locked to NVIDIA hardware. Subhadip Mitra’s new paper presents TritonMoE, a fused MoE dispatch kernel written entirely in OpenAI Triton. It performs the full forward pass — router scoring, token permutation, expert GEMMs, and weighted output combination — using only portable Triton primitives. The key innovation is a fused gate+up GEMM kernel that computes both SwiGLU projections from shared L2-cached input tiles with in-register SiLU activation, cutting global memory traffic by 35%.

On an NVIDIA A100, TritonMoE achieves between 89% and 131% of the throughput of Megablocks at inference batch sizes up to 512 tokens, tested across Mixtral-8x7B, DeepSeek-V3, and Qwen2-MoE. Crucially, all 162 correctness tests pass on both the A100 and an AMD MI300X with zero code changes, demonstrating cross-platform portability. The paper also characterizes sensitivity to routing imbalance under Zipfian-skewed expert assignments, noting that with 64+ experts under extreme skew, TritonMoE’s fixed-tile scheduling underperforms Megablocks’ block-sparse layout — motivating dynamic block-to-expert assignment as future work. Code is publicly available.

Key Points

TritonMoE is a fused MoE dispatch kernel in OpenAI Triton, not CUDA, enabling portability across NVIDIA and AMD GPUs.
On an A100, it delivers 89-131% of Megablocks throughput (batch ≤512 tokens) and cuts global memory traffic by 35% via fused gate+up GEMM.
All 162 tests pass on both A100 and AMD MI300X with zero code changes; the paper identifies a weakness with 64+ experts under extreme routing skew.

Why It Matters

A CUDA-free MoE kernel that runs on AMD GPUs lowers hardware lock-in for large-scale model inference.

Read Original Article

TritonMoE runs MoE models on AMD and NVIDIA without CUDA

Why It Matters

Related Articles

🚀 Stay Ahead in AI