Research & Papers

Communication Offloading on SmartNIC DPUs: A Quantitative Approach

New 'Buddy' engine on Nvidia BlueField-3 DPU speeds up host apps by up to 55%

Deep Dive

As CPU cores become increasingly expensive and power-hungry, researchers are turning to SmartNIC Data Processing Units (DPUs) to offload communication tasks that would otherwise consume host processor cycles. A new paper from Jacob Wahlgren and colleagues (KTH, LLNL, UC Davis) introduces 'Buddy' — a communication offloading engine that implements an asynchronous 'fire-and-forget' messaging model directly on programmable DPU cores. Buddy runs both on Nvidia BlueField-3 DPUs and generic x86 CPUs, and it decouples message routing from the application process entirely.

Benchmarking Buddy across five applications revealed that host-dominated workloads benefit most: Quicksilver and Sparse Matrix Transpose saw speedups up to 1.55x. However, the evaluation also uncovered a serious bottleneck: DPUs lack Direct Cache Access (DCA) support, causing DRAM traffic to surge by a staggering 625x compared to CPU-native execution. This tradeoff means Buddy’s offloading wins are real but come at a memory bandwidth cost that future SmartNIC designs must address. The paper is set to appear at Euro-Par 2026 and provides a quantitative framework for engineers evaluating where—and when—DPU offloading makes sense.

Key Points
  • Buddy engine offloads asynchronous 'fire-and-forget' communication tasks from CPUs to DPUs like Nvidia BlueField-3.
  • Up to 1.55x speedup for host-dominated workloads (Quicksilver, Sparse Matrix Transpose) measured across five applications.
  • 625x increase in DRAM traffic on DPUs due to missing Direct Cache Access – a key design gap for next-gen SmartNICs.

Why It Matters

Buddy shows DPU offloading works for host-bound apps, but the 625x DRAM spike demands hardware fixes before widespread adoption.