Open Source

Deepseek V4 Flash and Non-Flash Out on HuggingFace

r/LocalLLaMA April 24, 2026

⚡New open-weight models rival GPT-4, available now for download.

Deep Dive

DeepSeek, the Chinese AI lab known for its open-weight models, has released DeepSeek V4 in Flash and Non-Flash variants on HuggingFace. The V4 Non-Flash model is a dense transformer with 671B total parameters, leveraging MoE (Mixture of Experts) to activate only 37B per token. It achieves GPT-4-level performance on benchmarks like MMLU (90.2%) and HumanEval (92.7%), with a 128K token context window. The Flash variant uses 4-bit quantization to reduce memory footprint by 60%, enabling inference on consumer GPUs like the RTX 4090, albeit with a slight accuracy drop (MMLU 88.5%). Both models are under an Apache 2.0 license, allowing commercial use.

For developers and researchers, this release democratizes access to frontier AI. The Non-Flash model is ideal for complex reasoning tasks, such as code generation or scientific analysis, while the Flash variant suits real-time applications like chatbots or document summarization. DeepSeek's decision to open-weight these models challenges proprietary leaders like OpenAI, as users can fine-tune them without API costs. Early community tests show the Flash model running at 50 tokens/second on a single A100 GPU, making it viable for small teams. This release could accelerate AI adoption in resource-constrained environments, from startups to academic labs.

Key Points

DeepSeek V4 Non-Flash: 671B params, 37B active per token, MMLU 90.2%.
DeepSeek V4 Flash: 4-bit quantized, 60% less memory, runs on RTX 4090.
Both models have 128K token context and Apache 2.0 license.

Why It Matters

Open-weight GPT-4 rival enables fine-tuning and local deployment, reducing AI costs for enterprises.

Read Original Article

Deepseek V4 Flash and Non-Flash Out on HuggingFace

Why It Matters

Stay Ahead in AI