Open Source

DeepSeek-V4: a million-token context that agents can actually use

Two MoE checkpoints with 1M context and 2% KV cache memory usage.

Deep Dive

DeepSeek released V4 today, a major update focused on efficient long-context inference for agentic workloads. The release includes two MoE checkpoints: DeepSeek-V4-Pro with 1.6T total parameters (49B active) and DeepSeek-V4-Flash with 284B total (13B active), both supporting a 1M-token context window. While benchmark numbers are competitive but not SOTA, the real innovation is in architecture optimizations that make long-context agent tasks practical. At 1M tokens, V4-Pro requires only 27% of the single-token inference FLOPs and 10% of the KV cache memory compared to V3.2. V4-Flash drops these further to 10% FLOPs and 7% KV cache. Against standard grouped query attention with 8 heads in bfloat16, V4 uses roughly 2% of the cache size.

The efficiency gains come from a hybrid attention mechanism that interleaves two approaches across layers. Compressed Sparse Attention (CSA) compresses KV entries by 4x using softmax-gated pooling with a learned positional bias, then applies a lightning indexer (FP4, ReLU-scored multi-head dot product) to pick top-k compressed blocks per query. Heavily Compressed Attention (HCA) compresses KV entries by 128x and uses dense attention over the compressed stream. Both mechanisms include a sliding-window branch for recent tokens. In V4-Pro's 61-layer stack, layers alternate between CSA and HCA, with FP8 storage for most KV entries and BF16 only for RoPE dimensions. The lightning indexer runs in FP4. These design choices compound to produce the dramatic memory and compute savings. The paper also describes post-training optimizations specifically targeting agent workflows, addressing issues like context budget overruns, KV cache filling GPU memory, and tool-call round trips degrading over long tasks.

Key Points
  • Two MoE checkpoints: V4-Pro (1.6T total, 49B active) and V4-Flash (284B total, 13B active), both with 1M-token context window
  • Hybrid attention (CSA and HCA) reduces KV cache memory to 2% of standard architectures and cuts FLOPs by 73-90%
  • Designed specifically for long-running agentic workloads like SWE-bench tasks and multi-step browsing sessions

Why It Matters

Enables practical deployment of million-token context agents without prohibitive GPU memory costs.