BDQ introduces Flatness, a new metric that quantifies outlier distribution to guide theory-optimal quantization?

BDQ introduces Flatness, a new metric that quantifies outlier distribution to guide theory-optimal quantization.

On LLaMA-3-8B, BDQ achieves less than 1% accuracy loss at W4A4 (4-bit weights and activations)?

On LLaMA-3-8B, BDQ achieves less than 1% accuracy loss at W4A4 (4-bit weights and activations).

On DeepSeek-R1-Distill-LLaMA-70B at W2A4KV16, BDQ reduces the performance gap by 39.1% over existing methods?

On DeepSeek-R1-Distill-LLaMA-70B at W2A4KV16, BDQ reduces the performance gap by 39.1% over existing methods.

Research & Papers

BDQ achieves theory-optimal LLM quantization with <1% accuracy loss

arXiv cs.LG May 20, 2026

⚡New flatness-based method cuts memory 4x while slashing outlier impact by 39%.

Deep Dive

Post-training quantization is critical for deploying large language models on resource-constrained hardware, but activation outliers have long plagued low-bit precision. A new paper from researchers at multiple institutions (including Xiusheng Huang, Zhe Li, and others) mathematically models the relationship between quantization error and outliers, then introduces a novel metric called Flatness to quantify outlier distribution. This theoretical framework leads to an optimal solution for minimizing quantization loss.

Building on this theory, the team presents Bidirectional Diagonal Quantization (BDQ), a framework that uses learned diagonal matrices to strategically disperse outlier magnitudes across weight and activation dimensions. BDQ establishes new benchmarks: on LLaMA-3-8B at aggressive W4A4 quantization (4-bit weights and activations), accuracy drops less than 1%. In the even more challenging W2A4KV16 setting on the 70B-parameter DeepSeek-R1-Distill-LLaMA-70B model, BDQ narrows the performance gap by 39.1% compared to prior state-of-the-art approaches. This work provides both a theoretical foundation and a practical tool for extreme LLM compression.

Key Points

BDQ introduces Flatness, a new metric that quantifies outlier distribution to guide theory-optimal quantization.
On LLaMA-3-8B, BDQ achieves less than 1% accuracy loss at W4A4 (4-bit weights and activations).
On DeepSeek-R1-Distill-LLaMA-70B at W2A4KV16, BDQ reduces the performance gap by 39.1% over existing methods.

Why It Matters

Enables running 70B+ models on consumer hardware with 4x memory reduction and minimal accuracy trade-off.

Read Original Article

BDQ achieves theory-optimal LLM quantization with <1% accuracy loss

Why It Matters

Related Articles

🚀 Stay Ahead in AI