Qwen3.5-4B GGUF quants comparison (KLD vs speed) - Lunar Lake
Independent tests show Bartowski's Q5_K_M quant offers the best balance of speed and accuracy on new Intel iGPUs.
An independent benchmark by user mradermacher has provided crucial data for developers running local AI on Intel's new Lunar Lake architecture. Testing over 50 different GGUF quantization files for Alibaba's Qwen3.5-4B model on a Core Ultra 258V laptop (with 18GB iGPU memory), the analysis measured the trade-off between inference speed (tokens/sec) and accuracy loss (Kullback–Leibler Divergence). The goal was to find the optimal quant for practical use, balancing responsiveness with model capability.
The results crowned Bartowski's Q5_K_M quantization as the standout winner for this hardware. It achieved a speed of 21.91 tokens per second while maintaining an exceptionally low KLD of just 0.0064, indicating minimal degradation from the original full-precision model. Other strong contenders included Unsloth's Q6_K quant for maximum accuracy (KLD 0.0036) and various Q4 variants for faster speeds above 28 tk/s, but with higher accuracy trade-offs.
This benchmark is significant because it moves beyond theoretical specs to real-world performance on consumer-grade hardware. It gives developers and enthusiasts a verified starting point for deploying capable small language models locally, directly comparing the output from popular quantizers like Unsloth, Bartowski, and mradermacher. The detailed KLD-per-GB metric also helps users decide based on their specific storage constraints versus accuracy needs.
- Bartowski's Q5_K_M quant for Qwen3.5-4B achieved the best balance with 21.91 tk/s speed and a very low 0.0064 KLD accuracy loss.
- The benchmark tested over 50 variants on an Intel Lunar Lake Core Ultra 258V laptop, utilizing the integrated GPU with 18GB of memory.
- Results provide a practical performance map, showing Q4_0 quants exceed 28 tk/s for speed, while Q6_K quants offer the highest accuracy under 0.004 KLD.
Why It Matters
Enables developers to optimally run efficient AI models on new consumer laptops, maximizing local performance for coding assistants and tools.