80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP
80 tokens/sec with 128K context on a 12GB GPU? Here's how.
A Reddit user demonstrated that Qwen3.6 35B A3B — a 35B total parameter model with 3B active — can generate text at over 80 tokens per second with a 128K context window on just 12GB of VRAM. Running on an RTX 4070 Super, the setup uses llama.cpp compiled from source with an experimental Multi-Token Prediction (MTP) PR. The MTP draft model predicts multiple tokens ahead, achieving 80-95% acceptance rates across coding, math, translation, and creative tasks.
The configuration relies on careful tuning: the `-fitt 1536` parameter reserves 1.5GB of VRAM for the draft model and KV cache, while the rest is used for the main model. Users with primary GPUs may need to adjust this value. The user also noted that running the dGPU as secondary (with iGPU for display) frees all 12GB for inference. This breakthrough shows that large language models with advanced features like 128K context are now practical on consumer hardware, thanks to quantization (Q4_K_XL) and optimized inference frameworks. The community can replicate these results using the provided Hugging Face GGUF and build guide.
- 80+ tok/sec with 128K context on a 12GB RTX 4070 Super using Qwen3.6 35B A3B GGUF
- Uses llama.cpp with MTP (Multi-Token Prediction) draft model achieving 80-95% acceptance rates
- Key parameter -fitt 1536 balances GPU/CPU load; requires experimental MTP PR not yet merged
Why It Matters
Democratizes large-scale AI inference on consumer GPUs, enabling 128K context models without expensive hardware.