Qwen3.6 GGUF Benchmarks
Unsloth's quantizations top the Pareto frontier 21 out of 22 times, while clarifying frequent update controversies.
Unsloth has published comprehensive benchmarks for their GGUF quantizations of the Qwen3.6-35B-A3B model, a key resource for the local LLM community. Their analysis shows that Unsloth's quantized files achieve the best trade-off between model quality (measured by Kullback–Leibler Divergence) and file size 21 out of 22 times on the Pareto frontier, making them a top choice for users balancing performance and storage. The benchmarks are available on their Hugging Face repository.
The post also serves as a detailed transparency report, addressing common criticisms about frequent model re-uploads. Unsloth clarifies that roughly 95% of these updates are due to factors outside their direct control. They cite specific examples: the Gemma 4 model required four re-uploads—three due to bug fixes in the underlying llama.cpp framework (where Unsloth contributed fixes), and one for an official template update from Google. Similarly, they identified and patched NaN (Not a Number) errors in 21% of their own MiniMax M2.7 quants and helped other providers like AesSedai fix theirs. For Qwen3.5 models, they shared 7TB of research to help the community optimize quantization of specific tensors, leading to industry-wide improvements.
Furthermore, Unsloth highlighted a critical, confirmed bug in CUDA 13.2 that causes gibberish output in low-bit quants, advising users to temporarily roll back to CUDA 13.1 until NVIDIA's fix in CUDA 13.3 is released. This proactive communication underscores their role in not just providing files, but in diagnosing complex, cross-platform issues affecting the entire local AI ecosystem.
- Unsloth's Qwen3.6-35B GGUF quants lead the Pareto frontier for KLD/disk space in 21 of 22 evaluated cases.
- The team clarified that 95% of their model re-uploads are due to upstream fixes (llama.cpp bugs, CUDA issues) or collaborative research, not internal mistakes.
- They identified and helped fix a critical CUDA 13.2 bug causing gibberish in low-bit quants, with a fix slated for CUDA 13.3.
Why It Matters
For professionals running local LLMs, these benchmarks identify the most efficient model files, while the transparency builds trust in a rapidly evolving, complex ecosystem.