BDQ achieves theory-optimal LLM quantization with <1% accuracy loss
New flatness-based method cuts memory 4x while slashing outlier impact by 39%.
Post-training quantization is critical for deploying large language models on resource-constrained hardware, but activation outliers have long plagued low-bit precision. A new paper from researchers at multiple institutions (including Xiusheng Huang, Zhe Li, and others) mathematically models the relationship between quantization error and outliers, then introduces a novel metric called Flatness to quantify outlier distribution. This theoretical framework leads to an optimal solution for minimizing quantization loss.
Building on this theory, the team presents Bidirectional Diagonal Quantization (BDQ), a framework that uses learned diagonal matrices to strategically disperse outlier magnitudes across weight and activation dimensions. BDQ establishes new benchmarks: on LLaMA-3-8B at aggressive W4A4 quantization (4-bit weights and activations), accuracy drops less than 1%. In the even more challenging W2A4KV16 setting on the 70B-parameter DeepSeek-R1-Distill-LLaMA-70B model, BDQ narrows the performance gap by 39.1% compared to prior state-of-the-art approaches. This work provides both a theoretical foundation and a practical tool for extreme LLM compression.
- BDQ introduces Flatness, a new metric that quantifies outlier distribution to guide theory-optimal quantization.
- On LLaMA-3-8B, BDQ achieves less than 1% accuracy loss at W4A4 (4-bit weights and activations).
- On DeepSeek-R1-Distill-LLaMA-70B at W2A4KV16, BDQ reduces the performance gap by 39.1% over existing methods.
Why It Matters
Enables running 70B+ models on consumer hardware with 4x memory reduction and minimal accuracy trade-off.