DeepSeek V4 Preview Released with Significant Inference Cost Savings and Huawei Chip Support
New 1.6T parameter model cuts KV cache memory by up to 13.7x
DeepSeek has launched V4 in preview, a new open-weight LLM family that rivals top proprietary US models while dramatically cutting inference costs and expanding hardware support. The release includes two MoE models: a 284B-parameter Flash variant (13B active) and a 1.6T-parameter Pro version (49B active). V4-Pro was trained on 33 trillion tokens and, per DeepSeek's benchmarks, beats all open-weight models and competes with the best proprietary Western LLMs—though real-world performance may vary.
Architecturally, V4 introduces a hybrid attention mechanism combining Compressed Sparse Attention and Heavy Compressed Attention. This reduces KV cache memory by 9.5x-13.7x versus V3.2, enabling a 1M token context window with far less infrastructure. The models also use FP8/FP4 mixed precision via quantization-aware training, halving memory needs compared to FP8. A new Muon optimizer speeds training convergence and improves stability. Crucially, V4 is validated on both Nvidia GPUs and Huawei's Ascend NPUs, reducing reliance on US hardware. The smaller Flash model further lowers serving costs, making frontier AI more accessible.
- Two MoE models: 284B-parameter Flash (13B active) and 1.6T-parameter Pro (49B active), trained on 33T tokens
- Hybrid attention cuts KV cache memory by 9.5x-13.7x vs V3.2, enabling a 1M token context window
- FP8/FP4 mixed precision halves memory vs FP8; validated on both Nvidia GPUs and Huawei Ascend NPUs
Why It Matters
DeepSeek V4 slashes memory and hardware costs, making frontier AI viable on cheaper, non-Nvidia chips.