llama.cpp b9158 adds RDNA3 support with faster AMD transpose
New release adds RDNA3 tensor core support and tunes kernels for AMD GPUs.
ggml-org's llama.cpp, the popular open-source LLM inference engine, released version b9158 on May 14. This update focuses on AMD GPU optimization: it adds RDNA3 support to the CUDA mma FA (flash attention) kernel, enabling tensor core usage for FP16 accumulation. Specifically, tiles must be 32 logical units long in the attention head direction for head sizes 80 and 112; otherwise, a regular length of 16 with FP32 accumulation is used. This change also enables more efficient data transposition for warp size 32, though it scrambles accumulators along the head dimension — handled by a new data_layout entry. The kernel parameters were tuned for RDNA3, RDNA4, and CDNA1 GPUs; CDNA now supports head sizes up to 256.
The release includes prebuilt binaries for many platforms: macOS Apple Silicon (with optional KleidiAI), Intel, iOS; Linux on x64, arm64, s390x (CPU and Vulkan), plus ROCm 7.2, OpenVINO, SYCL; Android arm64; Windows x64 and arm64 (CPU, CUDA 12/13, Vulkan, SYCL, HIP); and openEuler x86/aarch64 with ACL Graph. This extensive packaging ensures that researchers and developers can immediately leverage the improved AMD performance without compiling from source. The update is a significant step for running LLMs on RDNA3 hardware.
- Adds RDNA3 support to CUDA mma FA kernel with FP16 accumulation for VKQ tiles of 32 logical units
- Tuned kernel parameters for RDNA3, RDNA4, and CDNA1 GPUs; CDNA supports head sizes up to 256
- Prebuilt binaries available for macOS, Linux, Windows, Android, and openEuler across multiple backends
Why It Matters
llama.cpp users with AMD RDNA3 GPUs get faster, optimized LLM inference with tensor core support.