Open Source

New Qwen3.5-35B-A3B Unsloth Dynamic GGUFs + Benchmarks

New dynamic quantization method achieves state-of-the-art performance across nearly all bit levels after 9TB of testing.

Deep Dive

The Unsloth team has announced a major update to their quantized versions of the Qwen3.5-35B model, introducing dynamic GGUF quantizations that achieve state-of-the-art performance across nearly all bit levels. After extensive testing involving over 150 KL Divergence benchmarks and 9TB of GGUF files, the researchers identified optimal quantization strategies for different tensor types within the model architecture. They've made all research artifacts publicly available, including detailed metrics across 121 different quantization configurations, providing unprecedented transparency into the quantization process. The team also fixed a critical tool-calling chat template bug that affected all quant uploaders, ensuring better compatibility and functionality.

The technical analysis reveals specific insights: quantizing ffn_up_exps and ffn_gate_exps layers to 3-bit generally works well, while ssm_out layers should remain in higher precision due to dramatic performance degradation when quantized. The team is retiring MXFP4 quantization from most GGUF variants (Q2_K_XL, Q3_K_XL, Q4_K_XL) in favor of more effective methods, noting that Q4_K performs better than MXFP4 despite using slightly more bits per weight. They also found that imatrix (importance matrix) techniques significantly improve quantization quality, especially for lower-bit configurations, and that certain "I quants" (like iq3_xxs) cause 5-10% inference slowdowns. The updated Qwen3.5-35B-A3B GGUFs incorporate these findings, with other model sizes (112B, 27B) currently being converted.

Key Points
  • Dynamic quantization achieves 99.9% KL Divergence across nearly all bit levels after 9TB of testing
  • Retiring MXFP4 from most GGUF variants in favor of Q4_K which performs better despite using 4.5 vs 4.25 bits per weight
  • Specific tensor optimization: ffn_* layers work well at 3-bit while ssm_out layers should remain high-precision

Why It Matters

Enables more efficient deployment of large language models with minimal performance loss, crucial for resource-constrained applications.