Open Source

DeepSeek V4 Pro runs locally on Epyc workstation with Q4 quantization

A Reddit user achieves 8.6 t/s generation on a single RTX PRO 6000 Max-Q GPU.

Deep Dive

A Reddit user (u/phm) shared a successful local deployment of DeepSeek V4 Pro, a large language model, running on a high-end workstation. The setup leveraged a community-modified version of llama.cpp with CUDA support (based on work by u/antirez) and the Q4_K_M quantization format to fit the model into available memory. The hardware included an AMD Epyc Genoa 9374F CPU, 12×96GB of system RAM, and a single NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition GPU with 97,247 MiB of VRAM. After loading the quantized model, the user reported a prompt processing speed of 12.2 tokens per second and a generation speed of 8.6 tokens per second. The memory breakdown showed approximately 92,472 MiB used on the GPU for the model itself, plus overhead for context and compute buffers.

While DeepSeek V4 Pro is not an official release from DeepSeek (the company's latest public model is DeepSeek-V3), the community has adopted the naming convention for what appears to be an improved variant. The model's response to a simple query identified itself as 'DeepSeek, an AI assistant created by the Chinese company DeepSeek' with a knowledge cutoff of May 2025 and a 1M token context window — specs consistent with DeepSeek-V3 but labeled as V4 in the Reddit post. The successful local inference on a single high-end GPU (albeit one with 97GB VRAM) showcases the growing capability of open-source tools to run frontier-level LLMs outside the cloud, though the hardware requirements remain steep for most individuals.

Key Points
  • Model: DeepSeek V4 Pro quantized to Q4_K_M, running on llama.cpp with CUDA support.
  • Hardware: Single RTX PRO 6000 Blackwell Max-Q (97 GB VRAM) on Epyc Genoa 9374F workstation.
  • Performance: 12.2 t/s prompt speed, 8.6 t/s generation, using ~92.5 GB GPU memory for the model.

Why It Matters

Local inference of large models is becoming practical on high-end workstations, reducing reliance on cloud APIs.