Open Source

TurboQuant isn’t just for KV: Qwen3.5-27B at near-Q4_0 quality, about 10% smaller, and finally fitting on my 16GB 5060 Ti

New 3.5-bit format shrinks model size by 10%, enabling 27B models on consumer 16GB cards like the RTX 5060 Ti.

Deep Dive

A developer known as u/Imaginary-Anywhere23 has open-sourced a breakthrough quantization method called TQ3_1S that makes large language models dramatically more accessible. The technique uses Walsh-Hadamard rotation with 8-centroid quantization and dual half-block scales to compress Qwen3.5-27B to just 12.9GB—10% smaller than standard Q4_0 quantization—while maintaining nearly identical quality (7.257 vs 7.243 perplexity). This represents a significant advance in model compression, inspired by recent research like TurboQuant and RaBitQ-style transformations.

The practical impact is immediate: for the first time, 27B parameter models can run fully on consumer 16GB GPUs like the RTX 5060 Ti. Previously, users needed 24GB cards or had to accept lower-quality Q3-class quantization. The developer achieved this through a llama.cpp fork with CUDA runtime support, enabling local AI deployment without API fees. While still experimental, TQ3_1S demonstrates 130.87 tokens/second prompt processing and 15.55 tokens/second generation speeds, making powerful local AI accessible to developers with mid-range hardware.

The developer acknowledges this is an early result focused on Qwen3.5-27B rather than a universal solution, and plans to release TQ3_4S with faster processing speeds. The work represents grassroots innovation in the open-source AI community, where individual developers are pushing the boundaries of what's possible with consumer hardware through clever quantization techniques.

Key Points
  • TQ3_1S compresses Qwen3.5-27B to 12.9GB (10% smaller than Q4_0) with only 0.19% perplexity increase
  • Enables 27B parameter models to run fully on 16GB consumer GPUs like RTX 5060 Ti for the first time
  • Uses Walsh-Hadamard rotation with 8-centroid quantization and dual half-block scales in llama.cpp fork

Why It Matters

Democratizes access to powerful 27B models by eliminating the need for expensive 24GB+ GPUs, enabling local AI development on consumer hardware.