Qwen3.6-35B-A3B - even in VRAM limited scenarios it can be better to use bigger quants than you'd expect!
A user found that a 23GB Q4 model runs faster than an 18GB Q4 on an 8GB RTX 3070.
A Reddit user running a lightweight local LLM setup—an 8GB RTX 3070 paired with 64GB of DDR4 RAM—shared a surprising finding about quantized Mixture-of-Experts (MoE) models. Initially, they chose the smallest quantization available for Qwen3.6-35B-A3B, the IQ4_XS variant at ~18GB, expecting it to be the most performant given VRAM constraints. With llama.cpp optimizations and a 32K context window, they achieved 25-30 tokens per second. However, they encountered issues with looping during the model's thinking phase.
To their surprise, switching to a larger quantization—the Q4_K_XL variant at ~23GB—yielded significantly better results. Despite the larger file size, which requires more system RAM and CPU-GPU transfer, the model ran faster, hitting 32 tokens per second even with a massive 128K context window. They further experimented and found Q5_K_S to be the best balance of quality and speed, maintaining around 30 tokens per second at 50K context. The key insight is that for MoE architectures, which activate only a subset of parameters per token, larger quants can leverage better memory bandwidth and reduce CPU-GPU bottlenecks, leading to higher throughput even in VRAM-limited scenarios.
- Qwen3.6-35B-A3B Q4_K_XL (23GB) ran faster at 32 tokens/s than IQ4_XS (18GB) at 25-30 tokens/s on an 8GB RTX 3070.
- Best speed/quality balance was Q5_K_S, achieving ~30 tokens/s with 128K context and staying above 25 tokens/s at 50K context.
- For MoE models, larger quants can outperform smaller ones due to better memory bandwidth utilization and reduced CPU-GPU bottlenecks.
Why It Matters
Local LLM users can optimize performance by choosing larger quants for MoE models, even with limited VRAM.