kernel-anvil: 2x decode speedup on AMD by auto-tuning llama.cpp kernels per model shape
Open-source tool profiles GGUF model shapes to generate optimal kernel configs, no recompilation needed.
A new open-source tool called kernel-anvil delivers a major performance breakthrough for running LLMs on AMD GPUs. Created by developer Apollosenvy, it specifically targets a bottleneck in llama.cpp's MMVQ kernels, which use identical thread/block configurations for every layer regardless of its shape. This one-size-fits-all approach leaves significant performance untapped, especially on RDNA3 architecture. Kernel-anvil fixes this by reading a GGUF model file, identifying all unique GEMV (General Matrix-Vector) shapes, profiling each one on the user's actual GPU, and generating a JSON config file with optimal settings for 'nwarps' and 'rows_per_block'.
The tool requires a small, ~50-line patch to llama.cpp's `mmvq.cu` file to read the config at startup. The result is dramatic speedups without any model recompilation. On an AMD Radeon 7900 XTX, Qwen3.5-27B in Q4_K_M quantization jumped from 12 to 27 tokens per second—a 2.25x improvement. Individual kernel speedups ranged from 1.2x to 2.1x depending on the shape. The entire profiling and optimization sweep takes under a second. While currently supporting RDNA3 GPUs (7900/7800 series), CUDA and Metal support are planned, making it the first kernel optimization tool of its kind built for AMD, whereas previous efforts like KernelSkill and CUDA Agent focused exclusively on NVIDIA hardware.
- Achieves 2.25x decode speedup on AMD RDNA3, e.g., Qwen3.5-27B from 12 to 27 tok/s on a 7900 XTX.
- Solves llama.cpp's generic kernel configs by auto-tuning per-shape parameters (nwarps, rows_per_block) in <1 second.
- Requires only a pip install, a one-time model profile command, and a small ~50-line patch to llama.cpp.
Why It Matters
Dramatically lowers the cost and barrier to high-performance LLM inference on consumer AMD GPUs, challenging NVIDIA's dominance.