DeepSeek V4 paper full version is out, FP4 QAT details and stability tricks [D]
New techniques achieve 90% KV cache reduction with minimal quality loss.
DeepSeek's V4 paper dives deep into FP4 Quantization-Aware Training (QAT), a technique that quantizes MoE expert weights—the primary GPU memory consumers—directly to FP4 during late-stage training. The QK-path in the CSA indexer also uses FP4 activations, delivering a 2x speedup on QK selection while preserving 99.7% recall. Inference runs directly on these FP4 weights without dequantization, leading to dramatic efficiency gains: the V4-Pro model uses only 27% of V3.2's FLOPs and 10% of its KV cache at 1M context, while V4-Flash cuts that to 10% FLOPs and 7% KV cache.
Training stability for trillion-parameter MoE models is addressed with two novel mechanisms. Anticipatory routing deliberately desyncs the main model and router updates—using cached older parameters for routing during loss spikes—breaking the feedback loop that amplifies anomalies (20% overhead, only when needed). SwiGLU clamping imposes hard limits on linear path (-10 to 10) and gate path (max 10) to suppress extreme value cascades. DeepSeek also introduces a generative reward model that uses the same model to generate and evaluate outputs, trained on scored data with reasoning, reducing human labeling. Human evaluations show V4-Pro achieving 62.7% win rate vs Gemini 3.1 Pro (77.5% on writing quality), and 63% non-loss rate vs Opus 4.6 Max on white-collar tasks. 52% of users already consider V4-Pro their default coding model.
- FP4 QAT reduces MoE expert weights and QK activations to FP4, enabling 2x speedup on QK selection with 99.7% recall
- Anticipatory routing and SwiGLU clamping prevent loss spikes during trillion-parameter MoE training, with only 20% overhead during anomalies
- Generative reward model eliminates separate RLHF models; V4-Pro achieves 62.7% win rate vs Gemini 3.1 Pro on Chinese writing
Why It Matters
FP4 at scale could cut inference costs by 90%, making multi-agent workflows and large-context deployments drastically cheaper.