Developer Tools

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

Isolate, edit, and recompile GPU kernels in minutes instead of hours.

Deep Dive

Kerncap, developed by Cole Ramos and Keith Lowery, tackles the bottleneck of iterative GPU kernel tuning. Developers often must rebuild entire applications to test small kernel changes, but Kerncap automatically isolates kernels by intercepting dispatches at the HSA runtime layer. It works with both HIP and Triton, capturing metadata via a lightweight Python compile-hook shim. The tool performs an address-space closure of all device memory – a virtual-address-faithful snapshot that preserves embedded device pointers without requiring DWARF metadata or pointer chasing. It then emits self-contained reproducer projects: HIP reproducers use a Clang VFS overlay for source-level recompilation, while Triton reproducers are tuning-pinned to preserve the JIT kernel's numerical contract.

Across six workloads spanning HPC and ML domains on three AMD GPU architectures (CDNA2, CDNA3, RDNA3), Kerncap successfully extracted kernels from snapshots ranging from 152 MB to 30 GB. A notable achievement was capturing vLLM's Mixture-of-Experts weight pool through pointer indirection. In a vLLM case study, the edit-recompile-validate loop achieved a 13.6x speedup over the traditional workflow, cutting isolation from hours to a single command. The resulting reproducers also serve as a substrate for autotuning agents and LLM-driven kernel generators.

Key Points
  • Intercepts GPU dispatches at the HSA runtime for both HIP and Triton kernels.
  • Creates VA-faithful snapshots via address-space closure without DWARF metadata.
  • Achieves 13.6x speedup on a vLLM Mixture-of-Experts workload, reducing isolation to one command.

Why It Matters

Speeds GPU kernel development by automating isolation, enabling rapid iteration for AI/ML workloads on AMD hardware.