Fused gate+up GEMM eliminates 35% of global memory traffic for MoE inference?

Fused gate+up GEMM eliminates 35% of global memory traffic for MoE inference.

Achieves 89–131% of Megablocks throughput on A100 at batch sizes up to 512 tokens?

Achieves 89–131% of Megablocks throughput on A100 at batch sizes up to 512 tokens.

Runs unchanged on AMD MI300X without any vendor-specific code modifications?

Runs unchanged on AMD MI300X without any vendor-specific code modifications.

Research & Papers

TritonMoE kernel cuts memory traffic 35%, runs on NVIDIA and AMD

r/MachineLearning May 28, 2026

⚡New fused MoE kernel eliminates 35% global memory traffic on both NVIDIA and AMD GPUs.

Deep Dive

Mixture-of-Experts (MoE) models offer sparsity and efficiency but traditionally require CUDA kernels that lock users into NVIDIA hardware. A new preprint introduces TritonMoE, an inference kernel written entirely in OpenAI Triton, a domain-specific language that compiles to both NVIDIA and AMD GPUs without vendor-specific code. Its key innovation is a fused gate+up GEMM that computes both SwiGLU projections from shared tile loads, cutting global memory traffic by 35%. This design reduces memory bottlenecks in MoE routing, a critical factor for real-time inference workloads.

In benchmarks on an A100 GPU, TritonMoE achieves 89–131% of the throughput of the established Megablocks library at batch sizes up to 512 tokens, and the same kernel runs unchanged on AMD MI300X with no code modifications. However, performance degrades at larger batch sizes (2048+ tokens) and when handling 64+ experts under extreme routing skew. The open-source code and detailed writeup are available on GitHub and the author's blog, offering a portable alternative for production MoE inference across GPU architectures.

Key Points

Fused gate+up GEMM eliminates 35% of global memory traffic for MoE inference.
Achieves 89–131% of Megablocks throughput on A100 at batch sizes up to 512 tokens.
Runs unchanged on AMD MI300X without any vendor-specific code modifications.

Why It Matters

Enables efficient MoE inference across GPU vendors without vendor lock-in or costly CUDA rewrites.

Read Original Article

TritonMoE kernel cuts memory traffic 35%, runs on NVIDIA and AMD

Why It Matters

Related Articles

🚀 Stay Ahead in AI