DeepSeek V4 Flash runs on a single DGX Spark with 1M context length using vLLM at 0.9 GPU memory utilization?

DeepSeek V4 Flash runs on a single DGX Spark with 1M context length using vLLM at 0.9 GPU memory utilization

Achieves 3 concurrent requests at 22 tps each (66 tps total); 4th request causes performance fluctuation due to memory bandwidth limit?

Achieves 3 concurrent requests at 22 tps each (66 tps total); 4th request causes performance fluctuation due to memory bandwidth limit

Alternative llama.cpp setup with GGUF quantization also works, enabling persistent KV caching for agentic coding workflows?

Alternative llama.cpp setup with GGUF quantization also works, enabling persistent KV caching for agentic coding workflows

Viral Wire

NVIDIA community runs DeepSeek V4 Flash on single DGX Spark with 1M context

NVIDIA Developer Forums May 17, 2026

⚡DeepSeek V4 Flash achieves nearly 4 concurrent 1M-token requests on a $3K DGX Spark

Deep Dive

The NVIDIA developer community has successfully run DeepSeek V4 Flash on a single DGX Spark (GX10/GB10) with full 1M context length, bringing frontier AI capabilities to consumer hardware. User j0n documented a vLLM-based setup configured with max model length 1,048,576 tokens and GPU memory utilization of 0.9. The system achieved a GPU KV cache of 4,093,302 tokens, enabling up to 3.9 concurrent requests at 1M tokens each. Performance metrics showed ~66 tokens per second overall (22 tps per request) with 3 concurrent sessions stable. However, adding a fourth concurrent request caused severe GPU utilization fluctuations due to decode being memory-bandwidth limited, as noted by community expert jasl9187.

Alternative approaches also gained traction: marco.palaferri reported a llama.cpp-based setup using the antirez/ds4 GGUF quantized build of DeepSeek-V4-Flash, run as a systemd service with an OpenAI-compatible endpoint for agentic workflows. He enabled a persistent KV cache directory to mitigate the bottleneck in long-context coding tasks. The community discussed hardware limitations: the Spark's 5070-grade design and constrained memory bandwidth (80 GB/s vs Apple Ultra-class SoCs) are the primary bottlenecks. Jasl9187 remains optimistic about further optimization but hopes for next-gen hardware with >256GB memory and RTX Pro 6000-class die sizes. This breakthrough demonstrates that frontier open-source reasoning models can now run locally for real-world agentic workloads, albeit with careful concurrency management.

Key Points

DeepSeek V4 Flash runs on a single DGX Spark with 1M context length using vLLM at 0.9 GPU memory utilization
Achieves 3 concurrent requests at 22 tps each (66 tps total); 4th request causes performance fluctuation due to memory bandwidth limit
Alternative llama.cpp setup with GGUF quantization also works, enabling persistent KV caching for agentic coding workflows

Why It Matters

Enables running frontier-level AI locally for coding agents without cloud dependency, democratizing long-context reasoning on $3K hardware

Read Original Article

NVIDIA community runs DeepSeek V4 Flash on single DGX Spark with 1M context

Why It Matters

Related Articles

🚀 Stay Ahead in AI