Research & Papers

Researchers show transformers can drop one of three QKV projections with minimal quality loss

Sharing key-value projections cuts KV cache by 50% with only 3.1% perplexity degradation.

Deep Dive

The paper "Do Transformers Need Three Projections? Systematic Study of QKV Variants" (accepted at ICML 2026) challenges the fundamental design of attention mechanisms. The authors test three sharing constraints: Q-K=V (shared key-value), Q=K-V (shared query-key), and Q=K=V (single projection). Across synthetic tasks, vision benchmarks (MNIST, CIFAR, TinyImageNet, anomaly detection), and language modeling with up to 1.2B parameters on 10B tokens, they discover that the Q-K=V variant matches or occasionally surpasses the standard QKV transformer.

In language modeling, Q-K=V achieves a 50% reduction in KV cache with only a 3.1% perplexity degradation. Crucially, this projection sharing is complementary to head sharing techniques like GQA and MQA. Combining Q-K=V with GQA-4 yields an 87.5% cache reduction, and with MQA a staggering 96.9% reduction. The authors attribute Q-K=V's success to keys and values occupying similar representational spaces and attention operating in a low-rank regime, while Q=K-V breaks attention directionality. This work provides a practical, quantifiable path to deploying large language models on edge devices with drastically reduced memory footprint.

Key Points
  • Sharing key-value projections (Q-K=V) performs on par or better than full QKV across vision and language tasks.
  • On a 1.2B parameter language model, Q-K=V cuts KV cache by 50% with only 3.1% perplexity degradation.
  • Combining Q-K=V with MQA achieves 96.9% cache reduction, enabling practical on-device inference.

Why It Matters

This finding could dramatically reduce memory and latency for deploying large language models on edge devices.