New Qwen3.6 27B quant cuts thinking tokens by 40%+ while staying accurate
A custom INT8 recipe achieves faster inference with 40-59% fewer reasoning tokens.
A developer has shared promising results from a custom quantization of the Qwen3.6 27B model, inspired by Minachist's INT8 AutoRound recipe. The key insight: this quant dramatically reduces the number of 'thinking' tokens the model generates before producing an answer, without sacrificing correctness. In head-to-head tests on AIME-style math problems, the custom Q8 quant used 40-59% fewer tokens compared to popular quants like Q8_0 and UD Q8 K XL, while producing the same outputs. This translates to faster inference — for example, solving a problem in 2.5 minutes instead of 4 minutes at similar token throughput.
The custom quant's size is 36.2 GiB, slightly larger than UD Q8 K XL (34.9 GiB). The developer notes that the larger size may contribute to better performance, but the significant reduction in thinking tokens is the standout finding. By spending less time on reasoning, the model saves KV cache space and compute, making it more efficient for real-time applications. The model also supports MTP (multi-token prediction) layers for further speed gains. Early tests show consistent correctness across multiple runs with the same seed and sampling parameters.
The developer plans to run BF16 baselines and further validate whether less thinking is universally beneficial. For now, the results suggest that quantization recipes can be optimized not just for size and perplexity, but for the reasoning efficiency of the model itself. This opens the door to more cost-effective deployments where latency and token usage are critical.
- Custom Q8 quant used 9,671 tokens vs 16,234 for Q8_0 on an AIME problem — 40% less thinking, same accuracy.
- Model size is 36.2 GiB (slightly larger than UD Q8 K XL at 34.9 GiB) but reduces KV cache usage by generating fewer tokens.
- In a second test, custom quant used 5,666 tokens vs 13,596 for UD Q8 K XL — 59% less thinking, faster inference.
Why It Matters
Optimizing for fewer reasoning tokens can dramatically cut inference costs and latency while preserving accuracy.