llama-quantize hardcodes quantization level to -7 instead of 8 for some groups?

llama-quantize hardcodes quantization level to -7 instead of 8 for some groups

32-block groups are misaligned, requiring separate sorting and quantization?

32-block groups are misaligned, requiring separate sorting and quantization

Unsloth's UD Q4_K_XL uses pure q4_0 and is recommended as a temporary fix?

Unsloth's UD Q4_K_XL uses pure q4_0 and is recommended as a temporary fix

Open Source

Unsloth's UD Q4_K_XL fixes Google's broken quantization bug

r/LocalLLaMA June 09, 2026

⚡llama-quantize hardcodes to -7 while some groups need 8

Deep Dive

A recent post on Reddit has exposed a significant bug in Google's quantization process, affecting models quantized with the popular llama-quantize tool. The issue stems from two core problems: first, the quantization function is hardcoded to use a value of -7 even though some groups within the model are actually optimized for a value of 8. This leads to suboptimal compression and accuracy loss. Second, the 32-block groups are misaligned, causing their data to intermingle during quantization. The fix requires sorting and quantizing these groups separately.

In response, Unsloth has released an updated version called UD Q4_K_XL, which correctly implements pure q4_0 quantization despite its misleading name. The bf16/f16 scaling they reference is negligible but necessary for precision. While the difference is within margin of error, the community suspects Unsloth may be keeping their process hidden to maintain a competitive edge. Until an official patch is submitted (someone might have one soon), users working with quantized models should switch to Unsloth's solution to avoid degraded performance.

Key Points

llama-quantize hardcodes quantization level to -7 instead of 8 for some groups
32-block groups are misaligned, requiring separate sorting and quantization
Unsloth's UD Q4_K_XL uses pure q4_0 and is recommended as a temporary fix

Why It Matters

Fixing this quantization bug is critical for deploying efficient, accurate AI models in production environments.

Read Original Article

Unsloth's UD Q4_K_XL fixes Google's broken quantization bug

Why It Matters

Related Articles

Stay Ahead in AI