Llama.cpp b9387 boosts AMD GPU inference up to 76% with smarter quant routing
New CUDA optimizations for AMD MI250X deliver massive throughput gains on K-quants.
Llama.cpp’s newest release (b9387) refines CUDA kernel selection for quantized matrix multiplication on AMD MFMA hardware (e.g., MI250X). Previously, a single global threshold (MMVQ_MAX_BATCH_SIZE=8) decided between per-row GEMV (mul_mat_vec_q) and MFMA-tiled GEMM (mul_mat_q). On AMD CDNA, the optimal crossover point varies by quantization type because K-quants (Q3–Q6) have a heavier dequantization cost, making the GEMM path favorable at smaller batch sizes. This patch introduces a per-quant threshold for AMD MFMA: Q3_K, Q4_K, Q5_K switch to MMQ at batch≥4; Q2_K and Q6_K at batch≥6; other quants stay at batch≥8 (unchanged from before). The result is dramatic speedups on MI250X: Q4_K_S throughput at batch=8 jumps 68% (559→940 tok/s), Q5_K_S 76% (503→884), and Q3_K_S 40% (629→879). Legacy and IQ quants see no regression. Non-AMD-MFMA paths (NVIDIA, RDNA, CDNA1 without MFMA) are byte-identical to master. A debug flag (GGML_CUDA_FORCE_MMVQ=1) allows A/B testing.
- Up to 76% throughput boost on AMD MI250X for Q5_K_S (503→884 tok/s) and 68% for Q4_K_S (559→940) at batch=8.
- Per-quant batch thresholds: K-quants (Q3–Q5) switch to MMQ at batch≥4; Q2_K and Q6_K at batch≥6; others remain unchanged.
- Legacy and IQ quants retain the original MMVQ path to avoid regression; non-AMD hardware unaffected.
Why It Matters
Local AI inference on AMD GPUs gets significantly faster, enabling more responsive applications and tighter resource usage.