Research & Papers

BitsMoE framework slashes MoE LLM memory with smarter quantization

12.3x faster quantization, 27.8% accuracy jump, and 1.76x faster decoding at 2-bit.

Deep Dive

Running large Mixture-of-Experts (MoE) language models like Qwen3-30B-A3B-Base is memory-intensive because all expert weights must stay in RAM, even though only a fraction are activated per token. Existing compression methods—pruning or coarse quantization—struggle at ultra-low bitwidths: pruning removes model capacity permanently, and uniform quantization ignores the varying importance of different experts and weight directions.

BitsMoE, introduced by Jiayu Zhao and colleagues, tackles this with a spectral-energy-guided approach. It first uses singular value decomposition (SVD) on each MoE layer to separate a shared basis (kept in full precision) from expert-specific spectral factors (used as fine-grained quantization units). Then, it formulates mixed-precision bit allocation as an integer linear program that minimizes estimated reconstruction loss under a fixed bit budget. At 2-bit precision on Qwen3-30B-A3B-Base, BitsMoE achieves 12.3× faster quantization, a 27.83 percentage point accuracy improvement, and 1.76× faster decoding compared to GPTQ. The method is open-source and available on GitHub.

Key Points
  • BitsMoE uses SVD to decompose MoE layers into a shared unquantized basis and expert-specific spectral factors.
  • Bit allocation is solved via integer linear programming to minimize reconstruction loss under a fixed budget.
  • At 2-bit on Qwen3-30B-A3B-Base: 12.3× faster quantization, +27.83% accuracy, 1.76× faster decoding vs GPTQ.

Why It Matters

Enables deploying massive MoE models on edge devices with minimal accuracy loss and faster inference.