Kernel Foundry: AI optimizer achieves 100% GPU kernel correctness via evolution
New framework uses LLMs and multi-island search to auto-optimize GPU kernels.
Generating high-performance GPU kernels is notoriously difficult, requiring both algorithmic correctness and hardware-aware optimization. While LLMs show promise in code generation, they often produce kernels that are either correct but slow or fast but buggy. Kernel Foundry, developed by Zixuan Huang and colleagues, tackles this by combining LLM-generated code with a diagnosis-driven evolutionary search. The framework uses retrieval-augmented generation to produce initial candidate kernels, then refines them across multiple "islands" (parallel subpopulations) using structured error feedback. A centralized experience library accumulates optimization strategies, preventing redundant efforts. Crucially, the system includes explicit mechanisms to prevent cheating—ensuring that performance gains come from genuine kernel improvements, not shortcuts like skipping computations.
On the KernelBench benchmark, Kernel Foundry achieves up to 100% correctness on Level 2 tasks (the hardest tier), significantly outperforming existing baselines like standalone LLMs and traditional auto-tuning. The multi-expert approach allows the system to handle diverse kernel types—from matrix operations to custom CUDA kernels—without manual intervention. This work bridges the gap between large language models and evolutionary optimization, offering a scalable solution for automatic kernel generation. Future applications could extend to custom accelerators, edge devices, and domain-specific processors, reducing the need for expert human tuning in high-performance computing.
- Kernel Foundry combines LLM-based initialization with multi-island evolutionary search to generate GPU kernels.
- Achieves 100% correctness on KernelBench Level 2, the most challenging tier.
- Includes a centralized experience library and anti-cheating mechanisms to ensure genuine performance gains.
Why It Matters
Automates GPU kernel optimization, saving developer hours and enabling faster AI and HPC workloads.