NVIDIA community runs DeepSeek V4 Flash on single DGX Spark with 1M context
DeepSeek V4 Flash achieves nearly 4 concurrent 1M-token requests on a $3K DGX Spark
The NVIDIA developer community has successfully run DeepSeek V4 Flash on a single DGX Spark (GX10/GB10) with full 1M context length, bringing frontier AI capabilities to consumer hardware. User j0n documented a vLLM-based setup configured with max model length 1,048,576 tokens and GPU memory utilization of 0.9. The system achieved a GPU KV cache of 4,093,302 tokens, enabling up to 3.9 concurrent requests at 1M tokens each. Performance metrics showed ~66 tokens per second overall (22 tps per request) with 3 concurrent sessions stable. However, adding a fourth concurrent request caused severe GPU utilization fluctuations due to decode being memory-bandwidth limited, as noted by community expert jasl9187.
Alternative approaches also gained traction: marco.palaferri reported a llama.cpp-based setup using the antirez/ds4 GGUF quantized build of DeepSeek-V4-Flash, run as a systemd service with an OpenAI-compatible endpoint for agentic workflows. He enabled a persistent KV cache directory to mitigate the bottleneck in long-context coding tasks. The community discussed hardware limitations: the Spark's 5070-grade design and constrained memory bandwidth (80 GB/s vs Apple Ultra-class SoCs) are the primary bottlenecks. Jasl9187 remains optimistic about further optimization but hopes for next-gen hardware with >256GB memory and RTX Pro 6000-class die sizes. This breakthrough demonstrates that frontier open-source reasoning models can now run locally for real-world agentic workloads, albeit with careful concurrency management.
- DeepSeek V4 Flash runs on a single DGX Spark with 1M context length using vLLM at 0.9 GPU memory utilization
- Achieves 3 concurrent requests at 22 tps each (66 tps total); 4th request causes performance fluctuation due to memory bandwidth limit
- Alternative llama.cpp setup with GGUF quantization also works, enabling persistent KV caching for agentic coding workflows
Why It Matters
Enables running frontier-level AI locally for coding agents without cloud dependency, democratizing long-context reasoning on $3K hardware