Get faster qwen 3.6 27b
Custom llama.cpp build unlocks desktop-grade inference for a 27B model.
A Reddit user (admajic) shared a breakthrough configuration for running Qwen 3.6 27B locally on an RTX 3090 at 50 tokens/second with 100k context. The setup leverages a custom GGUF quantization (Q4_K_M) from Hugging Face and an experimental llama.cpp commit (am17an) that implements multi-token prediction (MTP) speculative decoding. By using Q4_0 K/V cache and a draft depth of 2, the model fits in 19GB VRAM and maintains high throughput. The user notes that draft depth 3 was too demanding for the 3090 and that context above 90k often leads to generation loops or quality degradation.
The achievement demonstrates that large context windows (100k tokens) for 27B-parameter models are now feasible on a single consumer GPU, significantly lowering the barrier for local AI applications in RAG, long-document analysis, and code generation. The user provides exact commands and configuration parameters (8 CPU threads, temperature 0.8, top-p 0.95) for reproducibility. A macOS version via Homebrew is also available. This work highlights the rapid progress in open-source LLM inference optimizations, bringing datacenter-like capabilities to desktops.
- 50 tokens/second on a single RTX 3090 with Qwen 3.6 27B using MTP speculative decoding.
- Requires ~19GB VRAM via Q4_0 K/V cache and Q4_K_M quantization; supports 100k context.
- Context above 90k causes generation loops and quality drops; draft depth limited to 2 for stability.
Why It Matters
Local long-context inference of 27B models is now practical on consumer GPUs, enabling private AI workflows.