Adds RDNA3 support to CUDA mma FA kernel with FP16 accumulation for VKQ tiles of 32 logical units?

Adds RDNA3 support to CUDA mma FA kernel with FP16 accumulation for VKQ tiles of 32 logical units

Tuned kernel parameters for RDNA3, RDNA4, and CDNA1 GPUs; CDNA supports head sizes up to 256?

Tuned kernel parameters for RDNA3, RDNA4, and CDNA1 GPUs; CDNA supports head sizes up to 256

Prebuilt binaries available for macOS, Linux, Windows, Android, and openEuler across multiple backends?

Prebuilt binaries available for macOS, Linux, Windows, Android, and openEuler across multiple backends

Developer Tools

llama.cpp b9158 adds RDNA3 support with faster AMD transpose

llama.cpp Releases May 15, 2026

⚡New release adds RDNA3 tensor core support and tunes kernels for AMD GPUs.

Deep Dive

ggml-org's llama.cpp, the popular open-source LLM inference engine, released version b9158 on May 14. This update focuses on AMD GPU optimization: it adds RDNA3 support to the CUDA mma FA (flash attention) kernel, enabling tensor core usage for FP16 accumulation. Specifically, tiles must be 32 logical units long in the attention head direction for head sizes 80 and 112; otherwise, a regular length of 16 with FP32 accumulation is used. This change also enables more efficient data transposition for warp size 32, though it scrambles accumulators along the head dimension — handled by a new data_layout entry. The kernel parameters were tuned for RDNA3, RDNA4, and CDNA1 GPUs; CDNA now supports head sizes up to 256.

The release includes prebuilt binaries for many platforms: macOS Apple Silicon (with optional KleidiAI), Intel, iOS; Linux on x64, arm64, s390x (CPU and Vulkan), plus ROCm 7.2, OpenVINO, SYCL; Android arm64; Windows x64 and arm64 (CPU, CUDA 12/13, Vulkan, SYCL, HIP); and openEuler x86/aarch64 with ACL Graph. This extensive packaging ensures that researchers and developers can immediately leverage the improved AMD performance without compiling from source. The update is a significant step for running LLMs on RDNA3 hardware.

Key Points

Adds RDNA3 support to CUDA mma FA kernel with FP16 accumulation for VKQ tiles of 32 logical units
Tuned kernel parameters for RDNA3, RDNA4, and CDNA1 GPUs; CDNA supports head sizes up to 256
Prebuilt binaries available for macOS, Linux, Windows, Android, and openEuler across multiple backends

Why It Matters

llama.cpp users with AMD RDNA3 GPUs get faster, optimized LLM inference with tensor core support.

Read Original Article

llama.cpp b9158 adds RDNA3 support with faster AMD transpose

Why It Matters

Related Articles

🚀 Stay Ahead in AI