Research & Papers

Accelerating Intra-Node GPU-to-GPU Communication Through Multi-Path Transfers with CUDA Graphs

arXiv cs.DC April 27, 2026

⚡Multi-path NVLink/PCIe transfer method cuts overhead in HPC clusters...

Deep Dive

A team of researchers from Queen's University (Amirhossein Sojoodi, Yiltan Hassan Temucin, Amirreza Baratisedeh, Hamed Sharifian, Ahmad Afsahi) has proposed a novel approach to accelerate intra-node GPU-to-GPU communication by integrating CUDA Graphs into the UCX framework. Their method leverages multiple communication paths simultaneously—including NVLink (direct GPU-to-GPU) and PCIe through the host—to maximize bandwidth and reduce overhead. By using CUDA Graphs to optimize and launch communication workflows as a single unit, they eliminate kernel launch latency and improve execution efficiency. This is, to their knowledge, the first seamless integration of CUDA Graphs into UCX, a key middleware for high-performance computing (HPC) and MPI-based applications.

In experiments on a four-GPU node (using NVIDIA GPUs), the CUDA Graph-based multi-path approach achieved up to a 2.95x bandwidth improvement compared to the single-path UCX (UCT::CUDA-IPC) baseline in GPU-to-GPU OMB bandwidth tests, for message sizes up to 512MB. The performance gain comes from overlapping transfers across different paths (NVLink and host PCIe) and reducing CPU-GPU synchronization overhead. This work directly addresses a critical bottleneck in modern HPC clusters where GPU communication latency limits scaling of distributed AI training and scientific simulations. The paper is available on arXiv (2604.22228) and is pending registration with DataCite.

Key Points

First integration of CUDA Graphs into UCX for multi-path GPU communication
Up to 2.95x bandwidth improvement over single-path UCX in GPU-to-GPU benchmarks
Concurrently uses NVLink and PCIe paths, tested on a four-GPU node with messages up to 512MB

Why It Matters

Faster GPU communication directly accelerates distributed AI training and HPC workloads, reducing time-to-solution.

Read Original Article

Accelerating Intra-Node GPU-to-GPU Communication Through Multi-Path Transfers with CUDA Graphs

Why It Matters

Stay Ahead in AI