Research & Papers

EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

LLMs run on edge devices at 1.58-bit, beating 3-bit models by 11.3 points

Deep Dive

A new paper on arXiv introduces EdgeRazor, a framework that pushes LLM compression to extreme low-bit widths without sacrificing accuracy. The method combines three modules: Mixed-Precision Quantization-Aware Distillation for fine-grained precision control, Adaptive Feature Distillation that distills an n-bit student from its 16-bit teacher, and Entropy-Aware KL Divergence that balances forward-reverse KL based on the teacher's output distribution. EdgeRazor works on base, instruction-tuned, and multimodal LLMs.

Results are striking: at just 1.88-bit average precision, EdgeRazor surpasses all 3-bit methods and beats the best 2-bit PTQ approaches by 11.3 percentage points—while requiring 4-10x less training budget than state-of-the-art QAT. For example, Qwen3-0.6B compressed to 1.58-bit drops storage from 1.41GB to 0.28GB and achieves 15.1x decoding speedup over 16-bit. The framework delivers higher compression ratios across all bit widths, making it practical to run capable LLMs on phones, IoT devices, and other resource-constrained hardware.

Key Points
  • EdgeRazor achieves 1.88-bit average precision, outperforming all 3-bit methods by 11.3 points vs 2-bit PTQ
  • Training budget is 4-10x lower than leading QAT approaches, enabling efficient deployment
  • Qwen3-0.6B at 1.58-bit reduces storage from 1.41GB to 0.28GB and speeds decoding by 15.1x

Why It Matters

EdgeRazor makes powerful LLMs deployable on edge devices, unlocking low-cost, private on-device AI.