New Python compiler generates fused GPU kernels 20% faster than torch.compile
A 5,000-line Python compiler stack beats PyTorch's C++ compiler on RTX 5090
Deep Dive
A developer built a hackable LLM compiler from scratch that lowers TinyLlama and Qwen2.5-7B through six IRs to CUDA kernels. On an RTX 5090, the emitted FP32 kernels run at geomean 1.11× vs PyTorch eager and 1.20× vs torch.compile, with full-block parity on TinyLlama-128 and Qwen2.5-7B at seq=128. Wins up to 4.7× on small reductions, SDPA, and KV-projections. The process is documented in two parts.
Key Points
- Compiler written in pure Python (5,000 lines) produces CUDA kernels for TinyLlama and Qwen2.5-7B, outperforming PyTorch eager by 1.11× and torch.compile by 1.20× on RTX 5090
- Achieves up to 4.7× speedup on small reductions, SDPA, and KV-projections by implementing classic GPU optimization steps like shared memory staging and bank conflict reduction
- Uses six intermediate representations (from Loop IR to fully scheduled TileOp) and 16 optimization passes, each reproducible via a CLI command
Why It Matters
Democratizes high-performance GPU kernel optimization with a transparent, hackable alternative to opaque industry compilers.