Compiler written in pure Python (5,000 lines) produces CUDA kernels for TinyLlama and Qwen2.5-7B, outperforming PyTorch eager by 1.11× and torch.compile by 1.20× on RTX 5090?

Compiler written in pure Python (5,000 lines) produces CUDA kernels for TinyLlama and Qwen2.5-7B, outperforming PyTorch eager by 1.11× and torch.compile by 1.20× on RTX 5090

Achieves up to 4.7× speedup on small reductions, SDPA, and KV-projections by implementing classic GPU optimization steps like shared memory staging and bank conflict reduction?

Achieves up to 4.7× speedup on small reductions, SDPA, and KV-projections by implementing classic GPU optimization steps like shared memory staging and bank conflict reduction

Uses six intermediate representations (from Loop IR to fully scheduled TileOp) and 16 optimization passes, each reproducible via a CLI command?

Uses six intermediate representations (from Loop IR to fully scheduled TileOp) and 16 optimization passes, each reproducible via a CLI command

Research & Papers

New Python compiler generates fused GPU kernels 20% faster than torch.compile

r/MachineLearning May 12, 2026

⚡A 5,000-line Python compiler stack beats PyTorch's C++ compiler on RTX 5090

Deep Dive

A developer built a hackable LLM compiler from scratch that lowers TinyLlama and Qwen2.5-7B through six IRs to CUDA kernels. On an RTX 5090, the emitted FP32 kernels run at geomean 1.11× vs PyTorch eager and 1.20× vs torch.compile, with full-block parity on TinyLlama-128 and Qwen2.5-7B at seq=128. Wins up to 4.7× on small reductions, SDPA, and KV-projections. The process is documented in two parts.

Key Points

Compiler written in pure Python (5,000 lines) produces CUDA kernels for TinyLlama and Qwen2.5-7B, outperforming PyTorch eager by 1.11× and torch.compile by 1.20× on RTX 5090
Achieves up to 4.7× speedup on small reductions, SDPA, and KV-projections by implementing classic GPU optimization steps like shared memory staging and bank conflict reduction
Uses six intermediate representations (from Loop IR to fully scheduled TileOp) and 16 optimization passes, each reproducible via a CLI command

Why It Matters

Democratizes high-performance GPU kernel optimization with a transparent, hackable alternative to opaque industry compilers.

Read Original Article

New Python compiler generates fused GPU kernels 20% faster than torch.compile

Why It Matters

Related Articles

🚀 Stay Ahead in AI