Open Source

PSA: If your local coding agent feels "dumb" at 30k+ context, check your KV cache quantization first.

A deep dive reveals aggressive KV cache quantization, not the model, causes JSON hallucinations and infinite loops.

Deep Dive

A technical investigation into widespread reports of AI coding agents like Qwen3-Coder and GLM 4.7 failing at long context has identified a root cause: aggressive KV cache quantization. Developers trying to run 30B+ parameter models with 64k+ context windows on 24GB VRAM often enable Q4 or Q8 quantization for the Key-Value cache in backends like llama.cpp or ExLlamaV3, as standard benchmarks show minimal perplexity impact. However, this creates a hidden flaw where agents in frameworks like OpenClaw enter infinite correction loops or hallucinate tool-call parameters after processing roughly 30,000 tokens, previously misdiagnosed as model context degradation.

The mechanical failure occurs because the Key (K) cache is exponentially more sensitive to precision loss than the Value (V) cache. Quantizing keys to 4-bit or 8-bit degrades the attention mechanism's ability to match exact syntax from schemas defined tens of thousands of tokens earlier, leading to malformed JSON outputs. Furthermore, in llama.cpp, this quantization forces heavy dequantization work onto the CPU, crippling prompt processing speed. The recommended solution for VRAM-constrained deployments is to use mixed precision—keeping the K-cache at FP16/FP8 while quantizing only the V-cache—or to simply reduce the maximum context window to preserve an unquantized cache, ensuring reliable agentic workflows that depend on rigid syntax.

Key Points
  • Aggressive KV cache quantization (Q4/Q8) to save VRAM causes AI agents to hallucinate JSON/tool-calls after ~30k tokens, not model failure.
  • The Key-cache is exponentially more precision-sensitive than the Value-cache; quantizing keys creates 'fuzzy' attention matching for long-context syntax.
  • Workarounds include using mixed precision (FP16 Keys, Q8 Values) in supported backends or reducing context size instead of quantizing the entire cache.

Why It Matters

Teams deploying long-context coding agents must configure quantization correctly to avoid unreliable, hallucinating AI that breaks production workflows.