Open Source

The exact KV cache usage of DeepSeek V4

DeepSeek V4's KV cache drops to 9.62 GiB at 1M tokens, nearly 9x smaller than V3.2.

Deep Dive

DeepSeek V4's latest paper reveals a groundbreaking reduction in KV cache usage, a critical bottleneck for long-context AI models. By introducing Cross-Layer Attention (CSA) and Hybrid Cross-Attention (HCA), the architecture compresses key-value storage dramatically. For the Pro variant (1600B parameters), KV cache drops to 9.62 GiB at 1 million tokens—compared to 83.88 GiB for V3.2 (671B). This is an 8x improvement, not the 9.5x initially claimed, but still transformative. The Flash variant (284B) uses just 6.72 GiB, enabling 1M context on a single 256 GB RAM system with an RTX 3090.

This efficiency stems from architectural innovations: CSA processes 4 tokens at once and compresses them into 1, while HCA uses an extreme 128:1 compression ratio. The result is a KV% (KV cache size as percentage of model parameters) of 0.3% for Pro and 1.18% for Flash, versus 6.25% for V3.2—a roughly 20x gain. This obliterates current transformer-SSM hybrids. Chinese AI labs Kimi and Zhipu are expected to build derivatives, and llama.cpp support could bring 1M context to consumer hardware soon.

Key Points
  • DeepSeek V4 Pro uses 9.62 GiB KV cache at 1M context, down from 83.88 GiB in V3.2 (8x reduction).
  • Flash variant (284B params) needs only 6.72 GiB, enabling 1M context on 256 GB RAM + RTX 3090.
  • KV% metric drops from 6.25% to 0.3%, a 20x efficiency gain over all hybrid SSM models.

Why It Matters

8x KV cache reduction means 1M context on affordable hardware, unlocking long-document analysis for professionals.