Open Source

llama.cpp's Flash Attention on RDNA3 cuts KV cache VRAM by 47%

New packing technique slashes memory without quality loss using GPU's native sudot4 instruction.

Deep Dive

A new Flash Attention kernel for llama.cpp on RDNA3 GPUs (using ROCm) dramatically cuts KV cache memory by repacking the K tensor without lossy quantization. Traditionally, users either quantize the KV cache (losing quality) or keep it in fp16 (burning VRAM). The new 'packed16 K' approach stores K as 8-bit payloads plus fp16 scales, then uses AMD's native `sudot4` instruction — which performs four INT8 dot products in a single operation — to feed the GPU the exact data layout it needs. The result: VRAM drops by 47% compared to Vulkan f16 K. At 128k context with an active MTP draft model (two full contexts), this saves 1.42 GiB (from 23.18 GiB down to 21.76 GiB), often the difference between fitting the session or not.

Quality remains near-lossless because the packing is a storage layout change, not quantization. The K tensor is still fp16 at rest; it’s only repacked on write. Measured on WikiText-2 with a 27B model (ctx=512, chunks=4), using packed16 K with q4_0 V yields 97.06% same top token accuracy and a mean KLD (Kullback-Leibler divergence) of 0.00455 — well under the 0.01 threshold for near-indistinguishable token distributions. With q8_0 V, results improve further: 97.94% same top token and KLD of 0.00283. For professionals, this means you can run larger contexts, speculative decoding, or multi-turn sessions on consumer RDNA3 hardware without sacrificing output quality.

Key Points
  • Packed16 K reduces VRAM by 47% (1.42 GiB) compared to Vulkan f16 K on RDNA3 at 128k context with MTP.
  • Quality is near-lossless: 97.06% same top token, KLD of 0.0046 for q4_0 V; q8_0 V achieves 97.94% and KLD 0.0028.
  • Uses AMD's sudot4 instruction to pack four 8-bit K values into one int32, avoiding lossy quantization of the K cache.

Why It Matters

Slashing KV cache memory lets professionals run longer contexts and larger models on consumer RDNA3 GPUs.