New study reveals KV cache quantization silently destroys LLM safety alignment
Low-bit quantization can erase safety features, even when perplexity looks fine.
A new preprint from Bruce Changlong Xu, Adarsh Kumarappan, and Mu Zhou reveals a critical vulnerability in LLM deployment: low-bit KV cache quantization, a common technique to reduce memory usage, can silently collapse safety alignment. Testing 11 instruction-tuned models (3.8B-72B) across five benchmarks (1,894 prompts), they found that standard metrics like perplexity fail to detect this. For example, Mistral-7B loses 15.2% of its safety refusals at only 1.03x perplexity increase — a change invisible to typical evaluations. No universal safe bit-width exists; each model exhibits sharp, model-specific failure thresholds.
The team identifies the root cause geometrically: safety features reside in a low-dimensional activation subspace that is 10^2-10^3x more sensitive to quantization noise than the full representation space. They propose Per-Channel Reduction (PCR), a diagnostic that classifies models into three failure modes (outlier-crushes-safety, outlier-as-safety, multi-layer dilution) using only 20 calibration prompts. PCR predicts the correct mitigation direction across all primary models and a held-out model from another family. The resulting training-free protocol takes about 35 GPU minutes and recovers up to 97.2% of lost alignment (e.g., with KIVI), outperforming attention-based methods. Vulnerabilities were confirmed in production vLLM serving with FP8 KV cache on NVIDIA GPUs.
- Mistral-7B loses 15.2% of safety refusals at only 1.03x perplexity increase.
- Safety features occupy a low-dimensional subspace 10^2-10^3x more vulnerable to quantization noise.
- PCR recovers up to 97% of lost alignment in ~35 GPU minutes with no training required.
Why It Matters
Quantization memory savings come with hidden safety risks; PCR provides a training-free diagnostic and fix for production LLMs.