OSCAR RotationZoo compresses KV cache 7x with 2-bit quantization
Drop-in rotation files let you run 30B+ models on 8GB VRAM.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
OSCAR RotationZoo is a new tool from the FutureMLS Lab that enables extreme KV cache compression in large language models using 2-bit quantization. By capturing Q/K/V activations on a small calibration set and estimating attention-aware covariance offline, the method derives per-layer orthogonal rotations that preserve the directions attention actually uses. The result is a ~7x reduction in KV cache memory footprint with minimal accuracy loss—single-digit perplexity drop on GPQA benchmarks for dense reasoning models such as Qwen3-4B, Qwen3-8B, Qwen3-32B, and GLM-4.7B.
The precomputed rotation matrices are provided as drop-in .pt files, eliminating the need to re-run the Q/K/V dump and eigendecomposition. This makes it plug-and-play for researchers and practitioners who want to run medium-sized models (30-40B MoE or 10-20B dense) on consumer GPUs with 8GB VRAM. The release is generating buzz in the open-source community, with hopes to see it integrated into llama.cpp for broader accessibility.
- Achieves ~7x KV cache compression using INT2 quantization with single-digit perplexity drop on GPQA.
- Precomputed rotation matrices available for Qwen3-4B, 8B, 32B, and GLM-4.7B models.
- Drop-in .pt files eliminate the need to re-run expensive eigendecomposition; community hopes for llama.cpp integration.
Why It Matters
Enables running 30-40B parameter models on 8GB VRAM, democratizing large LLM inference.