Achieves ~7x KV cache compression using INT2 quantization with single-digit perplexity drop on GPQA?

Achieves ~7x KV cache compression using INT2 quantization with single-digit perplexity drop on GPQA.

Precomputed rotation matrices available for Qwen3-4B, 8B, 32B, and GLM-4.7B models?

Precomputed rotation matrices available for Qwen3-4B, 8B, 32B, and GLM-4.7B models.

Drop-in .pt files eliminate the need to re-run expensive eigendecomposition; community hopes for llama.cpp integration?

Drop-in .pt files eliminate the need to re-run expensive eigendecomposition; community hopes for llama.cpp integration.

Open Source

OSCAR RotationZoo compresses KV cache 7x with 2-bit quantization

r/LocalLLaMA May 25, 2026

⚡Drop-in rotation files let you run 30B+ models on 8GB VRAM.

Deep Dive

OSCAR RotationZoo is a new tool from the FutureMLS Lab that enables extreme KV cache compression in large language models using 2-bit quantization. By capturing Q/K/V activations on a small calibration set and estimating attention-aware covariance offline, the method derives per-layer orthogonal rotations that preserve the directions attention actually uses. The result is a ~7x reduction in KV cache memory footprint with minimal accuracy loss—single-digit perplexity drop on GPQA benchmarks for dense reasoning models such as Qwen3-4B, Qwen3-8B, Qwen3-32B, and GLM-4.7B.

The precomputed rotation matrices are provided as drop-in .pt files, eliminating the need to re-run the Q/K/V dump and eigendecomposition. This makes it plug-and-play for researchers and practitioners who want to run medium-sized models (30-40B MoE or 10-20B dense) on consumer GPUs with 8GB VRAM. The release is generating buzz in the open-source community, with hopes to see it integrated into llama.cpp for broader accessibility.

Key Points

Achieves ~7x KV cache compression using INT2 quantization with single-digit perplexity drop on GPQA.
Precomputed rotation matrices available for Qwen3-4B, 8B, 32B, and GLM-4.7B models.
Drop-in .pt files eliminate the need to re-run expensive eigendecomposition; community hopes for llama.cpp integration.

Why It Matters

Enables running 30-40B parameter models on 8GB VRAM, democratizing large LLM inference.

Read Original Article

OSCAR RotationZoo compresses KV cache 7x with 2-bit quantization

Why It Matters

Related Articles

🚀 Stay Ahead in AI