KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization
New AI framework replaces trial-and-error LLM optimization with coordinated expert agents and dual memory.
A research team from Beihang University and Tsinghua University has introduced KernelSkill, a novel multi-agent framework designed to optimize GPU kernels—the core computational units that power AI training and inference. The system addresses a key limitation in current LLM-based optimization methods, which rely on opaque, learned heuristics leading to inefficient trial-and-error and poor interpretability. Instead, KernelSkill employs a team of coordinated AI agents equipped with explicit, knowledge-driven expert skills for tasks like loop unrolling and memory access optimization.
KernelSkill's architecture features a dual-level memory system: long-term memory stores reusable optimization skills, while short-term memory prevents agents from repetitive backtracking during the search for optimal code. This structured approach allows the framework to systematically apply proven techniques rather than guessing. In benchmark tests on KernelBench Levels 1-3, KernelSkill achieved a perfect 100% success rate, delivering average speedups of 5.44x, 2.82x, and 1.92x respectively over the standard Torch Eager baseline, outperforming all previous methods.
The framework represents a shift toward more transparent and efficient AI-for-AI development tools. By making the optimization process knowledge-driven and interpretable, it provides clearer insights into why certain code changes improve performance. The code has been made publicly available, allowing developers and researchers to apply this multi-agent approach to accelerate their own GPU-intensive workloads, from large language model training to scientific computing simulations.
- Replaces opaque LLM heuristics with explicit, knowledge-driven expert skills for GPU kernel optimization
- Uses a multi-agent framework with dual-level memory (long-term for skills, short-term to prevent backtracking)
- Achieved 100% success rate and up to 5.44x speedup on KernelBench, outperforming prior baselines
Why It Matters
Directly accelerates AI training and inference workloads, reducing computational costs and energy consumption for developers.