Developer Tools

trunk/30e30079a22806b8689c2ea5ba5a06d0c04ae448: [ROCm][CI] Remove distributed shards from trunk.yml (#177085)

AMD engineer's clever config change reduces distributed test shards, addressing a 7.3-hour bottleneck.

Deep Dive

An AMD engineer has implemented a strategic optimization to PyTorch's continuous integration (CI) pipeline that addresses a major performance bottleneck. Jithunnair-amd's pull request #177085 modifies the `trunk.yml` configuration file for PyTorch's ROCm (AMD's GPU computing platform) testing infrastructure. The core change replaces 4-GPU test runners with 2-GPU runners for distributed test shards, effectively halving the GPU resource pressure on the CI system. This fix comes in response to alarming metrics showing test-to-signal (TTS) times—the interval between test completion and result availability—had ballooned from 3.4 hours to 7.3 hours at the 90th percentile, with test duration itself increasing from 3.6 to 4.2 hours.

The engineering rationale behind this change reveals clever resource optimization. Analysis showed that of PyTorch's 4,495 distributed unit tests, only 324 (approximately 7%) actually require more than 2 GPUs to execute. By moving to 2-GPU runners for the majority of tests, the team sacrifices just 7% test coverage while achieving a 50% reduction in GPU resource contention. The remaining >2-GPU tests are handled through a separate periodic configuration (`periodic-rocm-mi355.yml`) that runs every 3 hours, ensuring full coverage is maintained over time. This approach demonstrates how understanding test requirements at a granular level can yield significant infrastructure efficiency gains without compromising software quality.

For the PyTorch development community, this optimization means faster feedback loops and reduced CI queue times. The PyTorch repository, with 98.2k stars and 27.2k forks, handles massive development traffic, making CI performance critical for maintainer productivity. By addressing the TTS regression through intelligent configuration rather than simply adding more hardware, the team demonstrates cost-effective scaling of open-source infrastructure. The change has already been approved and merged, providing immediate relief to developers working on AMD GPU support within one of the world's most popular deep learning frameworks.

Key Points
  • Switches ROCm CI from 4-GPU to 2-GPU runners, cutting distributed test resource pressure by 50%
  • Addresses critical TTS regression where times doubled from 3.4 to 7.3 hours at p90
  • Maintains 93% test coverage since only 324 of 4,495 tests (7%) require >2 GPUs

Why It Matters

Faster CI cycles mean quicker developer feedback and accelerated PyTorch development, especially for AMD GPU support.