Research & Papers

[P] TurboQuant for weights: near‑optimal 4‑bit LLM quantization with lossless 8‑bit residual – 3.2× memory savings

New weight quantization method cuts model size by 3.2x while maintaining near-identical performance on benchmarks.

Deep Dive

Researchers have successfully adapted the TurboQuant algorithm—originally developed for KV-cache compression—to tackle the critical challenge of LLM weight quantization. The new method provides a drop-in replacement for standard nn.Linear layers that achieves near-optimal distortion, meaning models retain almost all their original capabilities despite dramatic size reductions. Initial benchmarks on the Qwen3.5-0.8B model show the '4+4 residual' configuration compressing the model from 1,504MB to just 762MB while maintaining identical 14.29 perplexity on WikiText-103, representing a 3.2x memory saving with zero performance loss.

Further testing on the larger Qwen3.5-4B model reveals even more promising results. The 4+4 residual configuration with group size 128 achieves a mere 0.03 perplexity increase (from 10.67 to 10.70) while cutting the model size in half. Notably, the 4+2 residual configuration shows potential for even greater compression, achieving slightly better perplexity (10.65) than the baseline with just 6 total bits. The method's effectiveness is quantified through Kullback-Leibler divergence measurements, with 4+4 residual showing dramatically better KLD (0.0028) compared to standard 4-bit quantization (0.0852).

The technique represents a significant advancement in making large language models more accessible and efficient. By enabling near-lossless 4-bit quantization through clever residual encoding, developers can now deploy capable models on consumer hardware that previously required expensive GPUs with large VRAM. This breakthrough could accelerate the democratization of AI by making state-of-the-art models runnable on standard laptops and mobile devices without sacrificing performance.

Key Points
  • Achieves 3.2x memory reduction (1,504MB → 762MB) on Qwen3.5-0.8B with zero perplexity increase
  • Maintains near-identical performance on Qwen3.5-4B (only +0.03 PPL) while halving model size
  • Uses innovative '4+4 residual' encoding that combines 4-bit weights with 4-bit residuals for near-lossless compression

Why It Matters

Enables running sophisticated LLMs on consumer hardware, dramatically reducing deployment costs and expanding AI accessibility.