In the recent kv rotation PR it was found that the existing q8 kv quants tank performance on AIME25, but can be recovered mostly with rotation
A rotation technique in a new PR mostly fixes q8 KV cache quantization's performance drop on the AIME25 benchmark.
A technical discovery within the open-source llama.cpp project has revealed a performance pitfall and a promising fix for a popular quantization method. In a recent pull request (#21038), contributors found that the existing implementation of q8 (8-bit) quantization for the Key-Value (KV) cache—a memory-intensive component critical for generating long text sequences—was causing a significant performance drop on the AIME25 benchmark. This benchmark is a key metric for evaluating reasoning and mathematical capabilities in AI models. The issue meant users opting for q8 quantization to save VRAM were inadvertently sacrificing model accuracy and output quality.
The proposed solution, detailed in the GitHub comments, involves applying a specific rotation technique to the quantized KV cache. Early analysis suggests this mathematical adjustment can recover most of the performance lost by the standard q8 quantization process. While some experts, like the Reddit user Betadoggo_, indicate they will continue using full fp16 precision for maximum accuracy, this fix is a major win for the broader community. It directly improves the efficiency vs. accuracy trade-off for developers and researchers running models like Llama 3 or Mistral locally on consumer hardware, making advanced AI more accessible without a major quality compromise.
- The q8 KV cache quantization in llama.cpp was found to 'tank performance' on the AIME25 reasoning benchmark.
- A fix using a KV rotation technique, proposed in GitHub PR #21038, recovers most of the lost performance.
- This optimizes the memory-accuracy trade-off for local LLM users, though some power users will stick with fp16.
Why It Matters
This fix makes running powerful local AI models more efficient, preserving critical reasoning performance while using less memory.