b8179
AMD's MI300X GPUs now run flash attention 39% faster with new CDNA3 MFMA support in llama.cpp.
The open-source llama.cpp project, maintained by ggml-org, has released a significant performance update with commit b8179. This commit introduces CDNA3 Matrix-Fused Multiply-Add (MFMA) support specifically for flash attention operations on AMD's MI300X (gfx942) GPUs. The core addition is support for the v_mfma_f32_16x16x16_f16 tensor core instruction, which allows the flash attention kernel to leverage the MI300X's specialized hardware for mixed-precision (FP16 in, FP32 accumulate) matrix operations. This is a targeted optimization for AMD's latest data center accelerators, expanding llama.cpp's hardware ecosystem beyond its traditional NVIDIA CUDA and Apple Silicon strengths.
The technical implementation adds a new MFMA intrinsic path and configures support for head sizes 64, 80, 96, 112, and 128. Benchmark results are substantial: for a Qwen2.5-1.5B model (Q4_K_M quantization), prompt processing shows speedups scaling with context length—+7% for 512 tokens, +13% for 1024, +23% for 2048, and a peak of +39% for 4096 tokens. The update intelligently dispatches between the new MMA kernel for prompt processing and the existing VEC kernel for token generation, where a -10% overhead was noted. All 2480 flash attention tests pass, and the change includes refined dispatch logic using a threshold of effective query count (eff_nq) >= 128. This commit represents a major step in optimizing the popular local inference engine for competitive performance on alternative AI hardware.
- Adds CDNA3 MFMA support for flash attention on AMD MI300X GPUs, enabling v_mfma_f32_16x16x16_f16 tensor core instructions.
- Delivers up to 39% faster prompt processing (pp4096) for models like Qwen2.5-1.5B with Q4_K_M quantization.
- Includes refined kernel dispatch logic (MMA for prompts, VEC for generation) and supports head sizes 64-128, passing all 2480 tests.
Why It Matters
Lowers the cost barrier for high-performance local AI inference by optimizing for AMD's competitive data center GPUs.