Viral Wire

DeepSeek-V4-Flash achieves 200K context on 2x DGX Spark at 44 tok/s

Confirmed end-to-end recipe for 200K context on dual-node DGX Spark hardware

Deep Dive

In a breakthrough for edge AI infrastructure, user tonyd615 has posted a verified recipe for running DeepSeek-V4-Flash (official FP8, 149GB/46 shards) across two NVIDIA DGX Spark (GB10) nodes at 200K context length. The setup uses 2x DGX Spark with 128GB unified memory each, connected via a direct QSFP56 200G cable with RoCE/NCCL over CX-7. Performance benchmarks show ~44 tok/s warm decode for a single stream, scaling to ~96 tok/s aggregate at concurrency=8, with a time-to-first-token (TTFT) of ~2s on short prompts. MTP speculative decoding achieves ~68% acceptance rate. The recipe uses a pinned vLLM commit from the jasl/vllm fork and the eugr/spark-vllm-docker PR #219.

Key implementation details include mandatory flags: --kv-cache-dtype fp8, --enable-expert-parallel, and --max-model-len 200000 with max_num_seqs=2 to stay within KV cache budget. Cold prefill is a weak spot—~53s at 32K and ~250s at 128K context—though optimizations from contributor jasl9187 recently yielded a ~20% improvement. Gotchas include the need to pin the NCCL commit, workarounds for Docker image transfer, and occasional CX-7 link wedging requiring a cold boot. The achievement proves that long-context inference (200K tokens) is feasible on affordable, compact NVIDIA hardware, opening doors for local AI workloads that previously required large clusters.

Key Points
  • Achieved 200K context on 2-node DGX Spark (128GB each) using DeepSeek-V4-Flash FP8
  • Warm decode performance: ~44 tok/s single stream, up to 96 tok/s at concurrency 8
  • Cold prefill slow (~250s at 128K); optimizations improving by 20%

Why It Matters

Democratizes large-context AI inference on compact hardware, enabling professionals to run advanced models locally.