Agent Frameworks

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

arXiv cs.MA March 12, 2026

⚡New AI framework replaces trial-and-error LLM optimization with coordinated expert agents and dual memory.

Deep Dive

A research team from Beihang University and Tsinghua University has introduced KernelSkill, a novel multi-agent framework designed to optimize GPU kernels—the core computational units that power AI training and inference. The system addresses a key limitation in current LLM-based optimization methods, which rely on opaque, learned heuristics leading to inefficient trial-and-error and poor interpretability. Instead, KernelSkill employs a team of coordinated AI agents equipped with explicit, knowledge-driven expert skills for tasks like loop unrolling and memory access optimization.

KernelSkill's architecture features a dual-level memory system: long-term memory stores reusable optimization skills, while short-term memory prevents agents from repetitive backtracking during the search for optimal code. This structured approach allows the framework to systematically apply proven techniques rather than guessing. In benchmark tests on KernelBench Levels 1-3, KernelSkill achieved a perfect 100% success rate, delivering average speedups of 5.44x, 2.82x, and 1.92x respectively over the standard Torch Eager baseline, outperforming all previous methods.

The framework represents a shift toward more transparent and efficient AI-for-AI development tools. By making the optimization process knowledge-driven and interpretable, it provides clearer insights into why certain code changes improve performance. The code has been made publicly available, allowing developers and researchers to apply this multi-agent approach to accelerate their own GPU-intensive workloads, from large language model training to scientific computing simulations.

Key Points

Replaces opaque LLM heuristics with explicit, knowledge-driven expert skills for GPU kernel optimization
Uses a multi-agent framework with dual-level memory (long-term for skills, short-term to prevent backtracking)
Achieved 100% success rate and up to 5.44x speedup on KernelBench, outperforming prior baselines

Why It Matters

Directly accelerates AI training and inference workloads, reducing computational costs and energy consumption for developers.

Read Original Article

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Why It Matters

Stay Ahead in AI