Final Qwen3.5 Unsloth GGUF Update!
The Unsloth team releases optimized GGUF files for Qwen3.5 models, achieving a 51% reduction in maximum KL divergence.
The Unsloth team has announced a significant final update to their GGUF (GPT-Generated Unified Format) quantization files for Alibaba's Qwen3.5 series of open-source language models. This release focuses on optimizing the trade-off between model size and quality loss (KL Divergence) for several large models, including the 122B-A10B and 35B-A3B Mixture-of-Experts (MoE) variants. The update is described as likely the last for these GGUF files and comes with a note of gratitude for the Qwen team's immense contributions to open-source AI, often involving sleepless nights during model releases. All new GGUFs now utilize Unsloth's improved imatrix calibration dataset, promising small but noticeable improvements across various tasks.
The technical core of the update is a refined quantization method that directly targets and reduces Maximum KL Divergence (KLD), a key metric for measuring information loss during compression. For the UD-Q4_K_XL quantization preset, this results in an 8% increase in file size but a dramatic 51% reduction in maximum KLD, significantly preserving model quality. Benchmarks show similar gains for other presets. The release also includes chat template fixes for better tool-calling and coding outputs, and replaces BF16 layers with F16 for faster inference on unsupported hardware. Users are instructed to re-download the updated files for the 35B, 27B, and 122B models, with a massive 397B-A17B variant still uploading at the time of the announcement.
- New quantization method reduces Maximum KL Divergence by 51% for the UD-Q4_K_XL preset, trading an 8% file size increase for vastly better quality retention.
- All GGUF files now use an updated, manually-improved imatrix calibration dataset, boosting performance in chat, coding, long-context, and tool-calling use cases.
- Users must re-download Qwen3.5-35B-A3B, 27B, and 122B-A10B GGUF files to get the final updates, which include chat template fixes and F16 layers for broader hardware compatibility.
Why It Matters
Delivers higher-quality, locally-runnable AI models for developers, enabling better coding assistants and agents with reduced performance loss from compression.