Open Source

Kv cache quantization: ignorance, or malice?

Running Qwen-3.6 at q8 KV cache caused subtle mistakes and bad reasoning in agentic coding.

Deep Dive

A Reddit user with extensive software engineering experience reports that quantizing KV cache to q8 in their Qwen-3.6 27B FP8 model (running on vllm with two 3090s) leads to significant performance drops. They use the model for long-horizon agentic coding harness workloads with high context windows and concurrent sub-agents. At q8, they observe 'many subtle mistakes, tool calling issues, and just plain bad reasoning.' Performance dramatically improves when they pin KV cache at 16-bit. They express frustration that quantization is often presented as a serious solution, speculating it may only be acceptable for low-stakes chatbot applications.

They specifically mention 'turboquant' as another technique they've seen discussed but haven't tried, suspecting it also incurs an intelligence hit. The post calls into question the trade-off between memory savings and reasoning quality, especially for agentic tasks requiring precise tool use and long context. The user's anecdotal evidence adds to growing concerns about KV cache quantization for production-grade AI systems, where even minor errors can cascade in multi-step agentic workflows.

Key Points
  • User runs Qwen-3.6 27B FP8 on vllm with two 3090s for agentic coding with high context windows and concurrent sub-agents.
  • Quantizing KV cache to q8 caused subtle mistakes, tool calling issues, and poor reasoning; 16-bit greatly improved performance.
  • User questions why quantization is promoted for serious tasks, suggesting it may only be suitable for low-stakes chatbots or applications where errors are tolerable.

Why It Matters

For professionals deploying LLMs in agentic or high-stakes coding tasks, KV cache quantization may silently degrade reasoning and reliability.