Gemma 4 26B-A4B GGUF Benchmarks
New benchmarks reveal Unsloth's quantized models achieve the lowest KL Divergence, indicating superior accuracy preservation.
Unsloth has conducted a detailed benchmark analysis of their quantized GGUF versions for Google's Gemma 4 26B-A4B and Qwen3.6 models, providing crucial data for developers running models locally. The core metric is KL Divergence (KLD), which measures how closely a quantized model's output distribution matches the original full-precision (BF16) model—a lower KLD means better retained accuracy. The results are decisive: Unsloth's quantizations achieved the lowest mean KLD in 21 out of 22 tested model sizes, establishing them on the Pareto frontier for the optimal balance of size and performance. This makes their offerings the top recommendation for users prioritizing accuracy in compressed models.
Alongside the benchmarks, Unsloth announced several updates to their quantization lineup. They have refined their Q6_K quantizations to be more dynamic, offering a slight performance bump. More significantly, they introduced a new UD-IQ4_NL_XL quantization for the Gemma 4 model, which at 14.6GB fits neatly into 16GB of VRAM, bridging the gap between smaller 13.4GB and larger 16.4GB options. Parallel updates were made for Qwen3.6 models. Furthermore, Unsloth improved their MLX quants (for Apple Silicon) with better layering selection, resulting in lower perplexity and KLD scores compared to their previous versions and other methods like MSQ, as shown in their provided metrics table.
- Unsloth's Gemma 4 GGUF quantizations lead in accuracy retention, winning 21 of 22 size categories in KL Divergence benchmarks.
- New UD-IQ4_NL_XL quant for Gemma 4 fits in 16GB VRAM (14.6GB), and Q6_K & MLX quants received performance updates.
- The benchmarks provide a clear Pareto frontier for developers to choose the optimal model size versus accuracy trade-off for local deployment.
Why It Matters
This gives developers definitive data to deploy the most accurate, efficient quantized LLMs locally, optimizing performance within hardware constraints.