Open Source

Qwen 35B-A3B is very usable with 12GB of VRAM

Achieves ~47 t/s generation and 914 t/s prefill on budget hardware

Deep Dive

A user benchmarked the Qwen3.6-35B-A3B-MTP MoE model (quantized to IQ4_XS) on a 12GB RTX 3060 with 32GB DDR4 and CUDA 13.x under Windows. Using llama.cpp, they found plain decoding highly capable: llama-bench measured ~914 t/s for prefill (pp512) and ~46.8 t/s for generation (tg128). By tuning -ncmoe (number of MoE blocks on GPU), they discovered -ncmoe 18 as a safe default for generation, with -ncmoe 17 pushing the edge and -ncmoe 16 causing a performance cliff. KV cache sweeps showed q8_0 KV was essentially free in performance, so -ctk q8_0 -ctv q8_0 is recommended.

MTP (speculative decoding) was also tested using llama.cpp's MTP branch. The best result was ~47.7 t/s with depth 2 and -ncmoe 19—only 2% faster than well-tuned plain decoding. Depth 3 was worse. For practical coding, the user recommends a plain decoding profile with 32k context ( -c 32768 -ncmoe 20 -ctk q8_0 -ctv q8_0 -t 9), yielding ~43.4 t/s generation and leaving 273 MiB VRAM free. A faster 16k profile runs at ~44.5 t/s but is VRAM-limited (37 MiB free). The big lesson: 12GB VRAM is a very practical size for this 35B MoE model, balancing speed, context length, and memory headroom.

Key Points
  • Qwen3.6-35B-A3B-MTP (IQ4_XS) on RTX 3060 12GB: prefill 914 t/s, generation 46.8 t/s (plain)
  • Optimal -ncmoe setting is 18 for safe generation; -ncmoe 16 causes severe performance drop
  • MTP speculative decoding offers only ~2% speedup over plain decoding; recommend plain with 32k context for coding

Why It Matters

Proves that high-quality 35B MoE models are accessible on mid-range consumer GPUs, democratizing local AI inference.