Open Source

TurboQuant for weights: near‑optimal 4‑bit LLM quantization with lossless 8‑bit residual – 3.2× memory savings

r/LocalLLaMA March 27, 2026

⚡New weight compression method cuts LLM memory usage by 3.2x while maintaining near-optimal performance.

Deep Dive

Researchers have successfully adapted the TurboQuant algorithm, originally designed for KV-cache compression, to tackle the critical challenge of model weight quantization. The new method provides a drop-in replacement for standard nn.Linear layers in large language models, achieving near-optimal distortion with minimal performance loss. This breakthrough represents a significant advancement in making powerful LLMs more accessible on hardware with limited memory resources.

Benchmark results on the Qwen3.5-0.8B model using WikiText-103 show remarkable efficiency. The 4+4 residual configuration maintains the baseline perplexity of 14.29 while reducing model size from 1,504 MB to 762 MB—effectively halving the memory footprint with zero performance degradation. Pure 4-bit quantization achieves even greater compression (361-381 MB) with only modest perplexity increases of 1.94-2.28 points, demonstrating the method's flexibility across different precision requirements.

The technique's implementation as a Triton kernel ensures practical deployment efficiency, while the open-source availability on GitHub allows immediate experimentation and integration. This advancement comes at a crucial time when model sizes continue to grow exponentially, creating increasing pressure on memory bandwidth and storage requirements across both research and production environments.

Key Points

Achieves 3.2× memory savings with 4+4 residual method maintaining zero perplexity degradation
Compresses Qwen3.5-0.8B from 1,504 MB to 762 MB while keeping PPL at 14.29
Provides drop-in nn.Linear replacement with Triton kernel implementation for practical deployment

Why It Matters

Enables running larger, more capable LLMs on consumer hardware and edge devices, democratizing access to advanced AI.

Read Original Article

TurboQuant for weights: near‑optimal 4‑bit LLM quantization with lossless 8‑bit residual – 3.2× memory savings

Why It Matters

Stay Ahead in AI