Up to 76% throughput boost on AMD MI250X for Q5_K_S (503→884 tok/s) and 68% for Q4_K_S (559→940) at batch=8?

Up to 76% throughput boost on AMD MI250X for Q5_K_S (503→884 tok/s) and 68% for Q4_K_S (559→940) at batch=8.

Per-quant batch thresholds?

K-quants (Q3–Q5) switch to MMQ at batch≥4; Q2_K and Q6_K at batch≥6; others remain unchanged.

Legacy and IQ quants retain the original MMVQ path to avoid regression; non-AMD hardware unaffected?

Legacy and IQ quants retain the original MMVQ path to avoid regression; non-AMD hardware unaffected.

Developer Tools

Llama.cpp b9387 boosts AMD GPU inference up to 76% with smarter quant routing

llama.cpp Releases May 29, 2026

⚡New CUDA optimizations for AMD MI250X deliver massive throughput gains on K-quants.

Deep Dive

Llama.cpp’s newest release (b9387) refines CUDA kernel selection for quantized matrix multiplication on AMD MFMA hardware (e.g., MI250X). Previously, a single global threshold (MMVQ_MAX_BATCH_SIZE=8) decided between per-row GEMV (mul_mat_vec_q) and MFMA-tiled GEMM (mul_mat_q). On AMD CDNA, the optimal crossover point varies by quantization type because K-quants (Q3–Q6) have a heavier dequantization cost, making the GEMM path favorable at smaller batch sizes. This patch introduces a per-quant threshold for AMD MFMA: Q3_K, Q4_K, Q5_K switch to MMQ at batch≥4; Q2_K and Q6_K at batch≥6; other quants stay at batch≥8 (unchanged from before). The result is dramatic speedups on MI250X: Q4_K_S throughput at batch=8 jumps 68% (559→940 tok/s), Q5_K_S 76% (503→884), and Q3_K_S 40% (629→879). Legacy and IQ quants see no regression. Non-AMD-MFMA paths (NVIDIA, RDNA, CDNA1 without MFMA) are byte-identical to master. A debug flag (GGML_CUDA_FORCE_MMVQ=1) allows A/B testing.

Key Points

Up to 76% throughput boost on AMD MI250X for Q5_K_S (503→884 tok/s) and 68% for Q4_K_S (559→940) at batch=8.
Per-quant batch thresholds: K-quants (Q3–Q5) switch to MMQ at batch≥4; Q2_K and Q6_K at batch≥6; others remain unchanged.
Legacy and IQ quants retain the original MMVQ path to avoid regression; non-AMD hardware unaffected.

Why It Matters

Local AI inference on AMD GPUs gets significantly faster, enabling more responsive applications and tighter resource usage.

Read Original Article

Llama.cpp b9387 boosts AMD GPU inference up to 76% with smarter quant routing

Why It Matters

Related Articles

🚀 Stay Ahead in AI