Open Source

New Qwen 27B quantization fits 16GB VRAM with 105k context

14.1GB model runs 1.5x faster, eliminates blank outputs on NVIDIA GPUs.

Deep Dive

A new quantization of the Qwen-27B model, Qwen3.6-27B-i1-IQ4_KS-GGUF, has been released by cHunter789, specifically optimized for NVIDIA GPUs with 16GB VRAM. The model leverages ikawrakow's KS and KSS quants, which are not yet available in mainline llama.cpp, requiring the custom ik_llama.cpp fork (NVIDIA CUDA and CPU only—no AMD or Apple Silicon support). At just 14.1GB, it enables a 27B parameter model to run on consumer-grade GPUs. With a Q4_0 Hadamard KV cache configuration, the model supports a 105k context window, making it suitable for long-document tasks.

Performance testing over several days in production workflows revealed a 1.5x–1.75x speed increase over the previous IQ4_XS variant, with complete elimination of blank outputs and flawless search-replace functionality. Perplexity evaluations on a 65k context using the Gutenberg dataset yielded a final PPL of 7.4040 ± 0.02773. The model also passed the Qwen benchmark and a needle-in-a-haystack test across 100k tokens. Users can replicate the setup with the provided llama-server configuration, including flash attention and temperature 0.15 for deterministic output.

Key Points
  • 14.1GB model size fits 16GB NVIDIA VRAM using custom KS quants from ikawrakow.
  • Supports 105k context window with Q4_0 KV cache on ik_llama.cpp (CUDA only).
  • 1.5x–1.75x faster than prior IQ4_XS variant, with zero blank outputs and PPL of 7.4040.

Why It Matters

Enables running a 27B parameter model on consumer 16GB GPUs with large context, democratizing local LLM deployment.