Qwen3.6 35B A3B runs at 51 tok/s with 190K context on 8GB VRAM GPU
Achieve 37-51 tokens/sec on a laptop with 8GB VRAM and 32GB DDR5.
A Reddit user has shared a working setup that runs the Qwen3.6 35B A3B model (in Q5 quant) on modest consumer hardware: an RTX 4060 with 8GB VRAM and 32GB DDR5 5600MHz RAM. The configuration achieves 37-51 tokens per second while maintaining a context length of approximately 190K tokens. Two variants were tested: the base mudler/Qwen3.6-35B-A3B-APEX-GGUF and a reasoning-distilled version (hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF), both yielding similar throughput.
The key to reaching these speeds lies in a custom fork of llama.cpp with TurboQuant support, which dramatically improves KV cache efficiency at high context sizes. The user adopted specific flags: --ctx-size 192640, --n-gpu-layers 430, --n-cpu-moe 35, along with --no-mmap and --mlock to avoid slowdowns. DDR5 bandwidth was flagged as a critical factor—DDR4 users should expect lower performance. Q4 quant was noticeably worse for long-context reasoning compared to Q5. The user also notes that Linux outperforms Windows significantly for this workload.
- Running Qwen3.6 35B A3B at Q5 quant on RTX 4060 8GB VRAM + 32GB DDR5 yields 37-51 tok/s at 190K context
- TurboQuant KV cache fork of llama.cpp is essential for high-context throughput
- Q4 quant degrades long-context reasoning; Linux and DDR5 RAM bandwidth are critical for peak performance
Why It Matters
Democratizes large MoE models for long-context tasks on consumer laptops, useful for developers and researchers.