A Hackable ML Compiler Stack in 5,000 Lines of Python [P]
A hackable ML compiler that emits raw CUDA and teaches compiler design.
The modern ML compiler ecosystem is brutally complex: TVM alone is over 500,000 lines of C++, while PyTorch layers Dynamo, Inductor, and Triton on top of each other. Frameworks like XLA, MLIR, Halide, and Mojo add further confusion. Against this backdrop, a developer created a reference ML compiler stack in just 5,000 lines of pure Python that emits raw CUDA kernels. The compiler accepts small models (TinyLlama, Qwen2.5-7B) and lowers them through six distinct intermediate representations (IRs), each progressively closer to hardware. For example, a simple `torch.relu(torch.matmul(x + bias, w))` is first captured as a Torch IR (1:1 mirror of PyTorch ops), then decomposed into Tensor IR (elementwise/reduction operations), fused into a single loop nest in Loop IR, and finally scheduled onto threads and blocks in Tile IR with optimizations like shared memory tiling, register tiling, and async memory transfers via `cp.async`.
The pipeline is designed for education and hackability, not to beat Triton in performance. Every stage produces real, readable CUDA code, and the six-IR structure is minimal enough that a developer can trace an entire compilation step by step. Key innovations include a minimal unified op surface (Tensor IR) that allows future frontends (ONNX, JAX) to plug in without touching downstream passes, and aggressive fusion that avoids materializing intermediate tensors (e.g., a (16,64,16) intermediate is never actually created). The compiler's code and article are publicly available on GitHub (repository 'deplodock'), providing a rare, self-contained tutorial on ML compiler design from scratch.
- Replaces 500K+ lines of C++ (TVM) with just 5,000 lines of pure Python, emitting raw CUDA.
- Uses six intermediate representations (IRs) from Torch IR to Tile IR, including GPU-aware loop fusion and tiling.
- Designed as a hackable, educational reference—not a production competitor—with full step-by-step traceability.
Why It Matters
Democratizes understanding of ML compilers; a teachable reference for engineers and researchers tackling model acceleration.