Hobbyist pushes Qwen 3.6 27B to 100K context on 32GB VRAM GPU
Using Q8 quantization and speculative decoding, a user reaches 95K–105K context on a single RTX 5090.
A user on r/LocalLLaMA detailed experiments running the 27B-parameter Qwen 3.6 model at Q8 (8-bit) quantization on a 32GB VRAM RTX 5090 GPU. Using llama.cpp with flash attention enabled and speculative decoding via the draft-mtp (multi-token prediction) engine, they were able to achieve a 95,000-token context length with both key and value caches in Q8_0 quantized format. A second configuration pushed context to 105,000 tokens by using Q8_0 for the K cache and Q5_1 for the V cache, sacrificing a small amount of precision for additional memory headroom. VRAM utilization hovered near capacity, with starting free memory as low as 230MB on the 95K config, but the setup remained stable.
Benchmark results show strong performance: aggregate token generation across nine coding, reasoning, and translation tasks reached 141.6 tok/s on Python code and 146.0 tok/s on summarization, with an average speculative draft acceptance rate of 66%. The user noted that Q8 quantization produced subjectively better results than Q6 or Q5 for vibe coding, despite conventional wisdom that Qwen handles quantization well. The work demonstrates that high-context local inference on consumer GPUs is increasingly feasible, especially with careful tuning of KV cache precision and speculative decoding parameters.
- Achieved 95K context at Q8 with full Q8_0 KV cache; 105K context by mixing Q8_0 K cache and Q5_1 V cache.
- Speculative decoding (MTP) yielded 101–146 tok/s across tasks, with a 66% draft acceptance rate.
- System used RTX 5090 (32GB VRAM), 64GB system RAM, llama.cpp with flash attention.
Why It Matters
Local LLM users can now run long-context models on consumer GPUs, enabling entire codebases or documents without cloud dependencies.