Open Source

Qwen3.6-27B IQ4_XS FULL VRAM with 110k context

Reverting a single commit saves 16GB VRAM users 0.4GB model size.

Deep Dive

A developer named cHunter789 has identified and fixed a regression in the Qwen3.6-27B model's IQ4_XS quantization that inflated its file size from 14.7GB to 15.1GB, breaking compatibility with 16GB VRAM GPUs. The culprit was a single llama.cpp commit (1dab5f5a44) that hardcoded attn_qkv layer quantizations to a minimum of Q5_K, increasing the model's footprint. By reverting this commit and restoring the original IQ4_XS layer quantization, cHunter789 produced a custom GGUF that returns the model to its previous 14.7GB size.

Benchmarks show the custom model achieves a perplexity of 7.3804 ± 0.0276 at 65k context with q8_0 cache, virtually identical to the standard 15.1GB version's 7.3765 ± 0.0276. The fix is critical for users with 16GB VRAM cards, as the IQ4_XS format is described as a "unicorn"—the only viable quantization for running a 27B model at decent context sizes on such hardware. The custom model is available on Hugging Face as cHunter789/Qwen3.6-27B-i1-IQ4_XS-GGUF.

Key Points
  • A llama.cpp commit (1dab5f5a44) inflated Qwen3.6-27B's IQ4_XS GGUF from 14.7GB to 15.1GB by forcing attn_qkv layers to Q5_K minimum.
  • cHunter789's custom fix reverts the commit, restoring the 14.7GB model with near-identical perplexity (7.3804 vs 7.3765).
  • The IQ4_XS format is the only viable option for running 27B models on 16GB VRAM with decent context for coding tasks.

Why It Matters

Restores 27B model access for 16GB VRAM users, critical for local AI coding and inference.