TurboQuant for weights: near‑optimal 4‑bit LLM quantization with lossless 8‑bit residual – 3.2× memory savings
New weight compression method cuts LLM memory usage by 3.2x while maintaining near-optimal performance.
Researchers have successfully adapted the TurboQuant algorithm, originally designed for KV-cache compression, to tackle the critical challenge of model weight quantization. The new method provides a drop-in replacement for standard nn.Linear layers in large language models, achieving near-optimal distortion with minimal performance loss. This breakthrough represents a significant advancement in making powerful LLMs more accessible on hardware with limited memory resources.
Benchmark results on the Qwen3.5-0.8B model using WikiText-103 show remarkable efficiency. The 4+4 residual configuration maintains the baseline perplexity of 14.29 while reducing model size from 1,504 MB to 762 MB—effectively halving the memory footprint with zero performance degradation. Pure 4-bit quantization achieves even greater compression (361-381 MB) with only modest perplexity increases of 1.94-2.28 points, demonstrating the method's flexibility across different precision requirements.
The technique's implementation as a Triton kernel ensures practical deployment efficiency, while the open-source availability on GitHub allows immediate experimentation and integration. This advancement comes at a crucial time when model sizes continue to grow exponentially, creating increasing pressure on memory bandwidth and storage requirements across both research and production environments.
- Achieves 3.2× memory savings with 4+4 residual method maintaining zero perplexity degradation
- Compresses Qwen3.5-0.8B from 1,504 MB to 762 MB while keeping PPL at 14.29
- Provides drop-in nn.Linear replacement with Triton kernel implementation for practical deployment
Why It Matters
Enables running larger, more capable LLMs on consumer hardware and edge devices, democratizing access to advanced AI.