Running Qwen3.6 35B A3B at Q5 quant on RTX 4060 8GB VRAM + 32GB DDR5 yields 37-51 tok/s at 190K context?

Running Qwen3.6 35B A3B at Q5 quant on RTX 4060 8GB VRAM + 32GB DDR5 yields 37-51 tok/s at 190K context

TurboQuant KV cache fork of llama.cpp is essential for high-context throughput?

TurboQuant KV cache fork of llama.cpp is essential for high-context throughput

Q4 quant degrades long-context reasoning; Linux and DDR5 RAM bandwidth are critical for peak performance?

Q4 quant degrades long-context reasoning; Linux and DDR5 RAM bandwidth are critical for peak performance

Open Source

Qwen3.6 35B A3B runs at 51 tok/s with 190K context on 8GB VRAM GPU

r/LocalLLaMA May 11, 2026

⚡Achieve 37-51 tokens/sec on a laptop with 8GB VRAM and 32GB DDR5.

Deep Dive

A Reddit user has shared a working setup that runs the Qwen3.6 35B A3B model (in Q5 quant) on modest consumer hardware: an RTX 4060 with 8GB VRAM and 32GB DDR5 5600MHz RAM. The configuration achieves 37-51 tokens per second while maintaining a context length of approximately 190K tokens. Two variants were tested: the base mudler/Qwen3.6-35B-A3B-APEX-GGUF and a reasoning-distilled version (hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF), both yielding similar throughput.

The key to reaching these speeds lies in a custom fork of llama.cpp with TurboQuant support, which dramatically improves KV cache efficiency at high context sizes. The user adopted specific flags: --ctx-size 192640, --n-gpu-layers 430, --n-cpu-moe 35, along with --no-mmap and --mlock to avoid slowdowns. DDR5 bandwidth was flagged as a critical factor—DDR4 users should expect lower performance. Q4 quant was noticeably worse for long-context reasoning compared to Q5. The user also notes that Linux outperforms Windows significantly for this workload.

Key Points

Running Qwen3.6 35B A3B at Q5 quant on RTX 4060 8GB VRAM + 32GB DDR5 yields 37-51 tok/s at 190K context
TurboQuant KV cache fork of llama.cpp is essential for high-context throughput
Q4 quant degrades long-context reasoning; Linux and DDR5 RAM bandwidth are critical for peak performance

Why It Matters

Democratizes large MoE models for long-context tasks on consumer laptops, useful for developers and researchers.

Read Original Article

Qwen3.6 35B A3B runs at 51 tok/s with 190K context on 8GB VRAM GPU

Why It Matters

Related Articles

🚀 Stay Ahead in AI