Open Source

OSCAR RotationZoo compresses KV cache 7x with 2-bit quantization

Drop-in rotation files let you run 30B+ models on 8GB VRAM.

Deep Dive

OSCAR RotationZoo is a new tool from the FutureMLS Lab that enables extreme KV cache compression in large language models using 2-bit quantization. By capturing Q/K/V activations on a small calibration set and estimating attention-aware covariance offline, the method derives per-layer orthogonal rotations that preserve the directions attention actually uses. The result is a ~7x reduction in KV cache memory footprint with minimal accuracy loss—single-digit perplexity drop on GPQA benchmarks for dense reasoning models such as Qwen3-4B, Qwen3-8B, Qwen3-32B, and GLM-4.7B.

The precomputed rotation matrices are provided as drop-in .pt files, eliminating the need to re-run the Q/K/V dump and eigendecomposition. This makes it plug-and-play for researchers and practitioners who want to run medium-sized models (30-40B MoE or 10-20B dense) on consumer GPUs with 8GB VRAM. The release is generating buzz in the open-source community, with hopes to see it integrated into llama.cpp for broader accessibility.

Key Points
  • Achieves ~7x KV cache compression using INT2 quantization with single-digit perplexity drop on GPQA.
  • Precomputed rotation matrices available for Qwen3-4B, 8B, 32B, and GLM-4.7B models.
  • Drop-in .pt files eliminate the need to re-run expensive eigendecomposition; community hopes for llama.cpp integration.

Why It Matters

Enables running 30-40B parameter models on 8GB VRAM, democratizing large LLM inference.