Research & Papers

New Python compiler generates fused GPU kernels 20% faster than torch.compile

A 5,000-line Python compiler stack beats PyTorch's C++ compiler on RTX 5090

Deep Dive

A developer built a hackable LLM compiler from scratch that lowers TinyLlama and Qwen2.5-7B through six IRs to CUDA kernels. On an RTX 5090, the emitted FP32 kernels run at geomean 1.11× vs PyTorch eager and 1.20× vs torch.compile, with full-block parity on TinyLlama-128 and Qwen2.5-7B at seq=128. Wins up to 4.7× on small reductions, SDPA, and KV-projections. The process is documented in two parts.

Key Points
  • Compiler written in pure Python (5,000 lines) produces CUDA kernels for TinyLlama and Qwen2.5-7B, outperforming PyTorch eager by 1.11× and torch.compile by 1.20× on RTX 5090
  • Achieves up to 4.7× speedup on small reductions, SDPA, and KV-projections by implementing classic GPU optimization steps like shared memory staging and bank conflict reduction
  • Uses six intermediate representations (from Loop IR to fully scheduled TileOp) and 16 optimization passes, each reproducible via a CLI command

Why It Matters

Democratizes high-performance GPU kernel optimization with a transparent, hackable alternative to opaque industry compilers.