Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection
New paper shows transformer keys need far fewer dimensions than values, slashing KV cache memory by 25GB per user.
A new research paper titled 'Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection' challenges a fundamental assumption in transformer architecture. Authored by Hengshuai Yao and Guan Wang, the work argues that the standard practice of using identical dimensionality for queries, keys, and values is inefficient. The core insight is that keys and queries serve a low-dimensional 'selection' role—distinguishing which past tokens to attend to—while values carry the rich semantic information. This means keys can be drastically compressed without significantly harming model performance, a hypothesis validated across seven experiments including language modeling on WikiText and compression of models like GPT-2 and LLaMA.
The most impactful finding is a practical method for existing models: applying Singular Value Decomposition (SVD) compression to the key matrices, followed by just 3 epochs of fine-tuning on a small dataset. On the 7.2B-parameter Mistral-7B model, this achieved a 75% reduction in the key-value (KV) cache memory footprint with only a 2.0% residual increase in perplexity (a measure of quality loss). For a model serving a 128,000-token context, this translates to saving approximately 25GB of GPU memory per user session. This efficiency gain could enable around 60% more concurrent users on the same hardware, directly addressing a major bottleneck and cost driver in deploying large language models at scale.
- Challenges transformer design symmetry, proving keys need only O(log N) dimensions for selection vs. full d_model.
- Method: SVD compression + lightweight QK fine-tuning saves 75% of KV cache memory with <2% quality loss on Mistral-7B.
- Enables ~60% more concurrent users by freeing ~25GB of GPU memory per 128K context session.
Why It Matters
Dramatically reduces the memory and cost of serving long-context LLMs, a critical barrier for real-world deployment.