Research & Papers

CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training

A new tool pinpoints faulty GPU ranks in just 6 minutes on 4,000-GPU clusters.

Deep Dive

Slow/hang anomalies in collective communication libraries (CCL) are the most common and time-consuming issues in large-scale AI training, often taking hours or days to diagnose due to complex hardware-software interactions. Traditional debugging methods are inaccurate and inefficient. To address this, a team of 20 researchers proposed CCL-D, a diagnostic system that integrates a lightweight distributed tracing probe with an intelligent decision analyzer. The probe collects cross-layer anomaly metrics in real time at the rank level, while the analyzer automates detection and root-cause localization, precisely identifying the faulty GPU rank without manual intervention.

Deployed on a production cluster of 4,000 GPUs over one year, CCL-D achieved near-complete coverage of known slow/hang anomalies. The system consistently pinpointed affected ranks within 6 minutes—dramatically outperforming existing solutions that could take hours or days. Accepted at PPoPP'26, CCL-D represents a major step forward for reliability in distributed AI training at scale, reducing costly downtime and enabling faster iteration cycles for large model development.

Key Points
  • CCL-D uses a lightweight distributed tracing framework to monitor cross-layer metrics across GPU ranks in real time.
  • Its intelligent decision analyzer automates detection and root-cause localization, identifying the exact faulty GPU rank.
  • On a 4,000-GPU cluster, CCL-D cut diagnosis time from hours/days to just 6 minutes with near-complete anomaly coverage.

Why It Matters

Reduces critical debugging time in large-scale AI training, cutting costly GPU cluster downtime by up to 99%.