Open Source

Skipping 90% of KV dequant work → +22.8% decode at 32K (llama.cpp, TurboQuant)

A 3-line kernel change bypasses dequantization bottlenecks, boosting decode speeds by nearly 23% at 32K context.

Deep Dive

A developer known as Pidtom has released a significant optimization for the popular llama.cpp inference engine, called TurboQuant Plus. The project addresses a critical performance bottleneck: the dequantization of the Key-Value (KV) cache, which stores past attention information for long-context generation. At 32K context lengths on an Apple M5 Max, this dequantization step alone was consuming approximately 40% of the total decoding time. After testing about 14 different low-level optimization approaches—including register look-up tables (LUTs), SIMD tricks, and fused kernels—the developer found the hardware was already at its limit, and none of these complex methods beat the baseline performance.

The breakthrough came from a simpler, algorithmic insight inspired by flash attention. Flash attention computes softmax weights before processing the Value (V) vectors. At long contexts, most of these attention weights are essentially zero. Therefore, instead of trying to make dequantization faster, the new kernel simply skips the V dequantization entirely for token positions with negligible attention. This change, implemented in about three lines of code, exploits the inherent sparsity in attention. Results on a Qwen3.5-35B model show a 22.8% increase in decode speed at 32K context when using the TurboQuant (turbo3) method, with no change in model perplexity (PPL), indicating no loss in output quality. The technique is not specific to TurboQuant; it leverages attention sparsity directly and also showed promising speedups on an M2 Pro chip.

Key Points
  • Skipped V dequantization for low-attention positions, a 3-line kernel change inspired by flash attention's workflow.
  • Achieved a 22.8% decode speed increase on Qwen3.5-35B at 32K context with no accuracy loss (perplexity unchanged).
  • Solved a hard bottleneck where dequantization alone was taking ~40% of decode time, after 14 other optimization attempts failed.

Why It Matters

This directly speeds up long-context AI inference on consumer hardware, making advanced models more practical for real-time applications.