Open Source

TurboQuant seems to work very well on Gemma 4 — and separately, per-layer outlier-aware K quantization is beating current public fork results on Qwen PPL

New quantization method achieves near-zero accuracy loss at ~3.1 bits per channel, beating public forks.

Deep Dive

A developer's deep dive into TurboQuant, a KV cache quantization method within the llama.cpp framework, reveals significant performance gains for Google's Gemma 4 26B model. Running on an Apple M4 Pro with 48GB RAM, the 'tq2j/q4_0' configuration achieved a 34% speedup over the standard 'q4_0/q4_0' setup when handling a 131K token context, with near-zero accuracy loss. The method uses techniques like QJL and FWHT rotations to compress the model's key-value cache to approximately 3.1 bits per channel, maintaining 36 out of 37 quality test passes. These results appear to surpass the performance of current public Gemma 4 forks, indicating TurboQuant's implementation may be more optimized for this specific architecture.

Separately, the developer applied a more sophisticated, per-layer and outlier-aware adaptive quantization strategy to Qwen2.5 and Qwen3 models from Alibaba. This approach, which carefully allocates bits based on each layer's variance and handles statistical outliers, beat the standard 'q8_0' quantization on perplexity (PPL) benchmarks at comparable bit-per-value (bpv) rates. For instance, Qwen2.5 7B scored a PPL of 8.927 versus 8.949 for q8_0. This success highlights that advanced calibration and layer-specific strategies, not just the base quantizer algorithm, are critical for closing the performance gap with full-precision models. The findings suggest there is substantial room for further optimization across different model families.

Key Points
  • TurboQuant on Gemma 4 26B achieved a 34% inference speedup at 131K context with only 1/37 quality test failures.
  • The Qwen2.5 7B model saw improved perplexity (8.927 vs 8.949) using a per-layer, outlier-aware K quantization method.
  • Results indicate superior calibration and layer-specific bit allocation are more impactful than the base quantizer alone.

Why It Matters

These advances make running large language models faster and more efficient on consumer hardware, lowering the barrier for local AI deployment.