Research & Papers

DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72

arXiv cs.DC April 03, 2026

⚡New parallelization strategy removes synchronization bottlenecks, letting GPUs work independently for faster AI responses.

Deep Dive

A research team led by Wanqian Li has introduced DWDP (Distributed Weight Data Parallelism), a novel parallelization strategy designed to accelerate large language model inference on multi-GPU systems like Nvidia's GB200 NVL72. The core innovation addresses a fundamental bottleneck: existing methods require frequent synchronization between GPUs at each layer, causing delays when workloads become imbalanced. DWDP instead adopts a data-parallel approach where model weights—specifically for Mixture of Experts (MoE) architectures—are distributed across peer GPUs. Each GPU can then progress independently, fetching the necessary 'expert' weights on demand without waiting for others, which removes the synchronization penalty entirely.

To make this practical, the researchers implemented two key optimizations within the TensorRT-LLM framework. The first manages the split-weight distribution efficiently, while the second employs asynchronous prefetching to pull remote weights before they're needed, hiding latency. In their evaluation using the 671-billion-parameter DeepSeek-R1 model on the powerful NVL72 platform, DWDP delivered an 8.8% improvement in end-to-end tokens-per-second per GPU. This gain was achieved while maintaining comparable overall throughput per user in the critical 20-100 TPS/user serving range, using 8,000-token inputs and 1,000-token outputs.

The technical report, published on arXiv, demonstrates that for modern, sparse MoE models, moving away from synchronized, layer-wise parallelism can yield significant performance dividends. This work points toward more efficient and scalable inference systems for the largest AI models, reducing the cost and latency of real-time AI applications. As models continue to grow, techniques like DWDP that minimize coordination overhead will become increasingly vital for deploying these systems in production environments.

Key Points

Eliminates collective inter-rank synchronization, allowing GPUs to work independently and avoid slowdowns from workload imbalance.
Implemented in TensorRT-LLM with optimizations for split-weight management and asynchronous remote-weight prefetching.
Achieved 8.8% higher tokens-per-second per GPU running DeepSeek-R1 on Nvidia GB200 NVL72 hardware with 8K/1K sequences.

Why It Matters

Lowers the cost and latency of serving massive AI models like DeepSeek-R1, making real-time, large-scale inference more feasible for businesses.

Read Original Article

DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72

Why It Matters

Stay Ahead in AI