Research & Papers

ucTrace: A Multi-Layer Profiling Tool for UCX-driven Communication

New open-source profiler exposes hidden bottlenecks in GPU and NIC communication for large-scale simulations.

Deep Dive

A research team from Koç University and Simula Research Laboratory has introduced ucTrace, a new open-source profiling tool designed to illuminate the often-opaque communication layer in high-performance computing (HPC). The tool specifically targets the Unified Communication X (UCX) framework, which is critical for enabling low-latency, high-bandwidth data transfer in multi-node CPU-GPU clusters and serves as the transport layer for many MPI implementations.

Existing profiling tools have struggled to provide visibility into the UCX layer, often lacking fine-grained traces or being limited to specific MPI libraries. ucTrace addresses this by profiling message passing directly at the UCX level. Its key innovation is the ability to correlate low-level communication operations—between hosts, GPUs, and network interface cards (NICs)—with the high-level MPI functions that initiated them. This multi-layer visibility is presented through interactive visualizations of process- and device-specific interactions.

The researchers validated ucTrace through a suite of experiments, including analyzing MPI point-to-point behavior under different UCX configurations, comparing Allreduce operations across MPI libraries, and profiling a GPU-accelerated GROMACS molecular dynamics simulation at scale. By exposing transport-layer behavior and device interactions, ucTrace provides system administrators and application developers with the detailed insights needed to pinpoint bottlenecks, optimize performance, and debug complex communication patterns in large-scale scientific and AI workloads. The tool is publicly available and will be presented at the 40th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2026).

Key Points
  • Profiles at the UCX transport layer, linking GPU/NIC operations to originating MPI functions for the first time.
  • Validated on real HPC workloads including GROMACS MD simulations with GPU acceleration at scale.
  • Open-source tool addresses a critical gap where existing profilers lack fine-grained UCX-level traces.

Why It Matters

Enables precise optimization of communication in large-scale AI training and scientific simulations, directly impacting performance and efficiency.