Open Source

Unsloth's UD Q4_K_XL fixes Google's broken quantization bug

llama-quantize hardcodes to -7 while some groups need 8

Deep Dive

A recent post on Reddit has exposed a significant bug in Google's quantization process, affecting models quantized with the popular llama-quantize tool. The issue stems from two core problems: first, the quantization function is hardcoded to use a value of -7 even though some groups within the model are actually optimized for a value of 8. This leads to suboptimal compression and accuracy loss. Second, the 32-block groups are misaligned, causing their data to intermingle during quantization. The fix requires sorting and quantizing these groups separately.

In response, Unsloth has released an updated version called UD Q4_K_XL, which correctly implements pure q4_0 quantization despite its misleading name. The bf16/f16 scaling they reference is negligible but necessary for precision. While the difference is within margin of error, the community suspects Unsloth may be keeping their process hidden to maintain a competitive edge. Until an official patch is submitted (someone might have one soon), users working with quantized models should switch to Unsloth's solution to avoid degraded performance.

Key Points
  • llama-quantize hardcodes quantization level to -7 instead of 8 for some groups
  • 32-block groups are misaligned, requiring separate sorting and quantization
  • Unsloth's UD Q4_K_XL uses pure q4_0 and is recommended as a temporary fix

Why It Matters

Fixing this quantization bug is critical for deploying efficient, accurate AI models in production environments.