BitsMoE uses SVD to decompose MoE layers into a shared unquantized basis and expert-specific spectral factors?

BitsMoE uses SVD to decompose MoE layers into a shared unquantized basis and expert-specific spectral factors.

Bit allocation is solved via integer linear programming to minimize reconstruction loss under a fixed budget?

Bit allocation is solved via integer linear programming to minimize reconstruction loss under a fixed budget.

At 2-bit on Qwen3-30B-A3B-Base?

12.3× faster quantization, +27.83% accuracy, 1.76× faster decoding vs GPTQ.

Research & Papers

BitsMoE framework slashes MoE LLM memory with smarter quantization

arXiv cs.LG June 02, 2026

⚡12.3x faster quantization, 27.8% accuracy jump, and 1.76x faster decoding at 2-bit.

Deep Dive

Running large Mixture-of-Experts (MoE) language models like Qwen3-30B-A3B-Base is memory-intensive because all expert weights must stay in RAM, even though only a fraction are activated per token. Existing compression methods—pruning or coarse quantization—struggle at ultra-low bitwidths: pruning removes model capacity permanently, and uniform quantization ignores the varying importance of different experts and weight directions.

BitsMoE, introduced by Jiayu Zhao and colleagues, tackles this with a spectral-energy-guided approach. It first uses singular value decomposition (SVD) on each MoE layer to separate a shared basis (kept in full precision) from expert-specific spectral factors (used as fine-grained quantization units). Then, it formulates mixed-precision bit allocation as an integer linear program that minimizes estimated reconstruction loss under a fixed bit budget. At 2-bit precision on Qwen3-30B-A3B-Base, BitsMoE achieves 12.3× faster quantization, a 27.83 percentage point accuracy improvement, and 1.76× faster decoding compared to GPTQ. The method is open-source and available on GitHub.

Key Points

BitsMoE uses SVD to decompose MoE layers into a shared unquantized basis and expert-specific spectral factors.
Bit allocation is solved via integer linear programming to minimize reconstruction loss under a fixed budget.
At 2-bit on Qwen3-30B-A3B-Base: 12.3× faster quantization, +27.83% accuracy, 1.76× faster decoding vs GPTQ.

Why It Matters

Enables deploying massive MoE models on edge devices with minimal accuracy loss and faster inference.

Read Original Article

BitsMoE framework slashes MoE LLM memory with smarter quantization

Why It Matters

Related Articles

🚀 Stay Ahead in AI