Custom Turing CUDA kernels optimize W8A8 INT8 matrix multiplication on RTX 2080 Ti Tensor Cores, reducing PCIe Gen3 bandwidth bottlenecks?

Custom Turing CUDA kernels optimize W8A8 INT8 matrix multiplication on RTX 2080 Ti Tensor Cores, reducing PCIe Gen3 bandwidth bottlenecks

Achieved 255 prefill tokens/s on a 284B total / 13B active MoE model using 4x 2080 Ti with heterogeneous VRAM/system RAM offloading?

Achieved 255 prefill tokens/s on a 284B total / 13B active MoE model using 4x 2080 Ti with heterogeneous VRAM/system RAM offloading

Budget build under $2,500?

Intel Xeon E5-2696 v4, 4x 2080 Ti (11/22GB), 1TB DDR4 ECC – all open-sourced on GitHub

Open Source

DeepSeek-V4 runs on 4x RTX 2080 Ti with 255 prefill tok/s

r/LocalLLaMA May 20, 2026

⚡Budget $2.5k setup beats H100 clusters with custom Turing kernels and W8A8 quantization

Deep Dive

A developer has successfully run DeepSeek-V4-Flash, a 284B-parameter mixture-of-experts (MoE) model with only 13B active parameters, on a single node of consumer legacy hardware costing under $2,500. The build uses four RTX 2080 Ti GPUs (11GB or 22GB VRAM each via modified BIOS), an Intel Xeon E5-2696 v4 CPU, and 1TB of DDR4 ECC RAM. The team achieved approximately 255 prefill tokens per second by overcoming major bottlenecks through hardware-software co-optimization.

Key technical breakthroughs include custom Turing CUDA kernels that accelerate W8A8 (INT8) quantized matrix multiplication on the 2080 Ti's Tensor Cores, significantly alleviating the PCIe Gen3 bandwidth choke. They also implemented heterogeneous inference with optimized static memory splitting and dynamic offloading between GPU VRAM and system RAM, ensuring 100% hardware utilization. A pipelined execution strategy hides the massive multi-GPU communication overhead caused by MoE routing. The entire implementation, deployment script, and preliminary tech report are open-sourced on GitHub (github.com/lvyufeng/deepseek-v4-2080ti).

Key Points

Custom Turing CUDA kernels optimize W8A8 INT8 matrix multiplication on RTX 2080 Ti Tensor Cores, reducing PCIe Gen3 bandwidth bottlenecks
Achieved 255 prefill tokens/s on a 284B total / 13B active MoE model using 4x 2080 Ti with heterogeneous VRAM/system RAM offloading
Budget build under $2,500: Intel Xeon E5-2696 v4, 4x 2080 Ti (11/22GB), 1TB DDR4 ECC – all open-sourced on GitHub

Why It Matters

Democratizes frontier model inference: a $2.5k consumer setup now rivals million-dollar H100 clusters for MoE models.

Read Original Article

DeepSeek-V4 runs on 4x RTX 2080 Ti with 255 prefill tok/s

Why It Matters

Related Articles

🚀 Stay Ahead in AI