MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Elastic LLMs
New technique adjusts precision per token, matching specialized models with a single calibration.
A research team from Duke University, Samsung, Carnegie Mellon, and other institutions has introduced MoBiQuant, a novel quantization framework that enables large language models to dynamically adjust their computational precision at runtime based on individual token complexity. This breakthrough addresses a critical challenge in elastic LLM deployment: the traditional need to maintain separate calibration parameters for each precision level (e.g., 4-bit, 8-bit), which complicates resource-adaptive inference on edge devices and cloud infrastructure. The researchers identified that varying token-level sensitivity, driven by precision-dependent outlier migration, was the root cause of calibration instability, and their solution allows a single model to operate efficiently across different hardware constraints without performance degradation.
The technical innovation centers on two key components: a many-in-one recursive residual quantization method that can iteratively reconstruct higher-precision weights from lower-bit representations, and a token-aware router that dynamically selects the optimal number of residual bit slices for each input token. Experimental validation on the LLaMA3-8B model showed that MoBiQuant matches the performance of precision-specific post-training quantization (PTQ) methods while requiring only one calibration pass. This means developers can deploy a single model artifact that automatically scales its computational footprint—using lower precision for simpler tokens and higher precision for complex ones—enabling more efficient inference across heterogeneous environments from smartphones to data centers.
- Enables elastic LLM inference with dynamic precision switching (4-bit to 8-bit) based on token complexity
- Eliminates need for repeated calibration, matching bit-specific PTQ performance on LLaMA3-8B with single calibration
- Uses token-aware routing and recursive residual quantization to handle precision-dependent outlier migration
Why It Matters
Enables efficient AI deployment across devices by allowing models to dynamically adapt computational needs without sacrificing accuracy.