Research & Papers

LiftQuant: Continuous Bit-Width Quantization Fits 70B LLM to 24GB GPU

Breakthrough method enables fractional bit-widths, compressing 70B models to 2.4 bits with top performance.

Deep Dive

Traditional quantization methods are limited to rigid integer bit-widths (e.g., 2, 3 bits), creating a 'deployment gap' where large language models cannot be optimally fitted to available memory. LiftQuant, accepted as an ICML 2026 Spotlight paper, solves this by introducing a continuous bit-width framework. Its core innovation is a 'lift-then-project' mechanism: weight vectors are approximated by projecting a simple 1-bit lattice from a higher-dimensional 'lifted' space. The effective bit-width becomes a ratio of lifted to original dimensions, allowing quasi-continuous tuning. This produces a structured, non-uniform codebook akin to vector quantization (VQ) but remains hardware-friendly by relying solely on linear transformations and uniform quantizers.

Practical impact is transformative: a 70B model can be compressed to 2.4 bits to precisely fit a 24GB GPU, where it significantly surpasses the performance of state-of-the-art 2-bit models on the same device. This flexibility unlocks the ability to optimize memory-performance trade-offs for any GPU budget, enabling deployment of massive LLMs on consumer hardware without sacrificing quality. The authors have released code and checkpoints, making this approach immediately usable for researchers and engineers.

Key Points
  • LiftQuant uses a 'lift-then-project' mechanism to achieve quasi-continuous bit-widths, e.g., 2.4 bits, by varying the lifted dimension ratio.
  • Enables a 70B parameter LLM to be compressed to exactly fit a 24GB GPU while outperforming leading 2-bit quantization methods.
  • Method is hardware-friendly: decoding uses only linear transformations and 1-bit uniform quantizers, avoiding complex non-linear operations.

Why It Matters

Enables optimal memory-performance trade-offs, allowing large LLMs to run efficiently on consumer GPUs.