Open Source

Qwen3.6-35B-A3B - even in VRAM limited scenarios it can be better to use bigger quants than you'd expect!

r/LocalLLaMA April 25, 2026

⚡A user found that a 23GB Q4 model runs faster than an 18GB Q4 on an 8GB RTX 3070.

Deep Dive

A Reddit user running a lightweight local LLM setup—an 8GB RTX 3070 paired with 64GB of DDR4 RAM—shared a surprising finding about quantized Mixture-of-Experts (MoE) models. Initially, they chose the smallest quantization available for Qwen3.6-35B-A3B, the IQ4_XS variant at ~18GB, expecting it to be the most performant given VRAM constraints. With llama.cpp optimizations and a 32K context window, they achieved 25-30 tokens per second. However, they encountered issues with looping during the model's thinking phase.

To their surprise, switching to a larger quantization—the Q4_K_XL variant at ~23GB—yielded significantly better results. Despite the larger file size, which requires more system RAM and CPU-GPU transfer, the model ran faster, hitting 32 tokens per second even with a massive 128K context window. They further experimented and found Q5_K_S to be the best balance of quality and speed, maintaining around 30 tokens per second at 50K context. The key insight is that for MoE architectures, which activate only a subset of parameters per token, larger quants can leverage better memory bandwidth and reduce CPU-GPU bottlenecks, leading to higher throughput even in VRAM-limited scenarios.

Key Points

Qwen3.6-35B-A3B Q4_K_XL (23GB) ran faster at 32 tokens/s than IQ4_XS (18GB) at 25-30 tokens/s on an 8GB RTX 3070.
Best speed/quality balance was Q5_K_S, achieving ~30 tokens/s with 128K context and staying above 25 tokens/s at 50K context.
For MoE models, larger quants can outperform smaller ones due to better memory bandwidth utilization and reduced CPU-GPU bottlenecks.

Why It Matters

Local LLM users can optimize performance by choosing larger quants for MoE models, even with limited VRAM.

Read Original Article

Qwen3.6-35B-A3B - even in VRAM limited scenarios it can be better to use bigger quants than you'd expect!

Why It Matters

Stay Ahead in AI