Open Source

DeepSeek-V4-Flash MTP quant hits 85 tok/s on dual RTX 6000

New quant pack boosts decode speed by 62-110% using MTP self-speculation on workstation GPUs

Deep Dive

A community quant of DeepSeek-V4-Flash called DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8 has achieved a remarkable 85.52 tokens per second at 524k context length using MTP (Multi-Token Prediction) self-speculation. The quant, created by user LordNeel on top of pasta-paul's base W4A16-FP8 work, restores the MTP head that HuggingFace transformers silently strips during loading. It applies a GPTQ pass (Frantar-style with Cholesky H⁻¹) to the 768 routed-expert tensors (256 experts × {w1,w2,w3}) in W4A16 INT4 group=128 symmetric format, while keeping attention projections in FP8_BLOCK and shared components in BF16/FP32. The model runs on two RTX PRO 6000 Blackwell Max-Q cards (96 GB each, no NVLink) using a custom vLLM fork.

Performance benchmarks show decode speed jumps from 52.85 tok/s (no MTP) to 85.52 tok/s at 524k with 2-stream MTP (1.62× speedup), and ~111 tok/s at 128k single-stream (2.10×). TTFT increases slightly due to MTP overhead but can be optimized with NCCL tuning: NCCL_PROTO=LL, NCCL_ALGO=Ring, NCCL_MIN_NCHANNELS=8, NCCL_NTHREADS=512 drops TTFT from ~155ms to ~91ms. A critical note for Max-Q cards: users must pass --disable-custom-all-reduce because vLLM's CustomAllreduce uses CUDA P2P and deadlocks on PCIe-only topology. The Server variant with NVLink avoids this issue. The model is available on Hugging Face and requires a patched vLLM fork to load.

Key Points
  • Achieves 85.52 tok/s at 524k context using 2-stream MTP, a 62% speedup over no MTP baseline.
  • Requires a patched vLLM fork and --disable-custom-all-reduce on Max-Q cards to avoid deadlocks.
  • Quantization uses W4A16 INT4 for 768 expert tensors via GPTQ, FP8 for attention, and BF16/FP32 for shared layers.

Why It Matters

Enables running a 671B parameter model at practical speeds on dual workstation GPUs, democratizing large-scale AI inference.