Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB
Community-driven tests on RTX 5080 reveal 7% speed gains and expose quantization pitfalls with hard data.
An independent benchmarker has published a comprehensive follow-up analysis of the Qwen3.5-35B-A3B model, running seven specific experiments requested by the community after initial performance tests. Using an RTX 5080 16GB GPU and llama.cpp, the tests focused on quantization quality and speed for this Mixture-of-Experts model with ~3B active parameters per token. The headline result confirms that using q8_0 quantization for the Key-Value (KV) cache is a 'free lunch,' providing significant throughput gains of 12-38% with no measurable degradation in perplexity (PPL) on 512-token contexts. The analysis also solidifies Q4_K_M as the top-performing general quantization method.
The technical deep dive employed KL Divergence (KLD) to validate PPL findings, revealing that the UD-Q4_K_XL quantization is 3.9x worse than Q4_K_M by mean KLD and preserves the correct top-1 token only 86.2% of the time. The researcher also tested a '--fit' flag configuration that achieved 74.7 tokens/second, a 7% speed increase over the original setup. For practitioners, the updated launch command and data provide a clear roadmap for deploying Qwen3.5-35B-A3B efficiently, though a caveat remains about potential q8_0 degradation at extremely long contexts (40-100k tokens).
- KV cache quantization to q8_0 confirmed as a 'free lunch,' offering 12-38% throughput boost with <0.4% PPL change.
- KL Divergence analysis shows UD-Q4_K_XL quantization is 3.9x worse than Q4_K_M, correctly preserving the top token only 86.2% of the time.
- Optimized configuration with '--fit' flag hits 74.7 tok/s on RTX 5080, a 7% speed increase over baseline settings.
Why It Matters
Provides data-driven optimization guidelines for running large MoE models efficiently, directly impacting deployment speed and cost for developers.