DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72
New parallelization strategy removes synchronization bottlenecks, letting GPUs work independently for faster AI responses.
A research team led by Wanqian Li has introduced DWDP (Distributed Weight Data Parallelism), a novel parallelization strategy designed to accelerate large language model inference on multi-GPU systems like Nvidia's GB200 NVL72. The core innovation addresses a fundamental bottleneck: existing methods require frequent synchronization between GPUs at each layer, causing delays when workloads become imbalanced. DWDP instead adopts a data-parallel approach where model weights—specifically for Mixture of Experts (MoE) architectures—are distributed across peer GPUs. Each GPU can then progress independently, fetching the necessary 'expert' weights on demand without waiting for others, which removes the synchronization penalty entirely.
To make this practical, the researchers implemented two key optimizations within the TensorRT-LLM framework. The first manages the split-weight distribution efficiently, while the second employs asynchronous prefetching to pull remote weights before they're needed, hiding latency. In their evaluation using the 671-billion-parameter DeepSeek-R1 model on the powerful NVL72 platform, DWDP delivered an 8.8% improvement in end-to-end tokens-per-second per GPU. This gain was achieved while maintaining comparable overall throughput per user in the critical 20-100 TPS/user serving range, using 8,000-token inputs and 1,000-token outputs.
The technical report, published on arXiv, demonstrates that for modern, sparse MoE models, moving away from synchronized, layer-wise parallelism can yield significant performance dividends. This work points toward more efficient and scalable inference systems for the largest AI models, reducing the cost and latency of real-time AI applications. As models continue to grow, techniques like DWDP that minimize coordination overhead will become increasingly vital for deploying these systems in production environments.
- Eliminates collective inter-rank synchronization, allowing GPUs to work independently and avoid slowdowns from workload imbalance.
- Implemented in TensorRT-LLM with optimizations for split-weight management and asynchronous remote-weight prefetching.
- Achieved 8.8% higher tokens-per-second per GPU running DeepSeek-R1 on Nvidia GB200 NVL72 hardware with 8K/1K sequences.
Why It Matters
Lowers the cost and latency of serving massive AI models like DeepSeek-R1, making real-time, large-scale inference more feasible for businesses.