KernelAgent: Hardware-Guided GPU Kernel Optimization via Multi-Agent Orchestration
Open-source system achieves 89% of H100 hardware efficiency and 2.02x speedup over previous kernels.
The PyTorch team has open-sourced KernelAgent, a multi-agent AI system that automates the complex process of GPU kernel optimization by integrating real hardware performance signals into a closed-loop workflow. Building on their previous correctness-focused system that achieved 100% accuracy on 250 benchmark tasks, this new version adds hardware-guided optimization where AI agents profile kernels, diagnose bottlenecks using NVIDIA Nsight Compute metrics, and iteratively apply targeted optimizations. The system specifically targets forward-pass inference kernels where latency directly impacts serving costs, automating what previously took kernel engineers days or weeks of manual work.
KernelAgent operates through a coordinated team of specialized agents: ProfilerAgent collects hardware signals, JudgeAgent diagnoses bottlenecks, AnalyzeAgent prescribes recommendations, and Optimization Manager explores different strategies in parallel. This multi-agent approach allows the system to evaluate multiple optimization paths concurrently while learning from each iteration through shared memory. In evaluations across 100 L1 KernelBench tasks, KernelAgent generated kernels that achieved 2.02x speedup over previous versions and outperformed default torch.compile in 65 of 100 tasks, reaching 89% of the theoretical hardware roofline efficiency on NVIDIA H100 GPUs. The open-source release includes the complete optimization codebase and documentation, enabling developers to automate kernel optimization for their specific AI workloads.
- Achieved 2.02x speedup over previous generated kernels and 1.56x speedup vs default torch.compile
- Reached 89% of NVIDIA H100 hardware roofline efficiency across 100 L1 benchmark tasks
- Open-source multi-agent system automates what previously required weeks of manual kernel engineering work
Why It Matters
Dramatically reduces AI inference costs by automating GPU kernel optimization, potentially cutting weeks of engineering time per kernel.