Developer Tools

PyTorch unifies CPU wheel builds, slashing runner-minutes by 60%

One runner now builds both aarch64 and x86 wheels, saving 140-183 minutes per nightly.

Deep Dive

PyTorch has merged a significant CI optimization (PR #183931) that consolidates building of CPU wheels for both aarch64 and x86 architectures onto a single runner. Previously, each Python version (3.9–3.13) triggered a separate build job, consuming up to 294 runner-minutes for x86 and 244 for aarch64. The new unified approach loops over all desired Python versions within one job, reusing the `build/` directory across iterations and only recompiling Python-specific parts like `libtorch_python` and `_C`. The key innovation is a template parameter `unified_arch_types = ['cpu', 'cpu-aarch64']` that acts as the partition switch, allowing future additions (e.g., CUDA variants) to follow the same pattern.

Results are striking: aarch64 runner-minutes dropped from 244 to 104 (57% reduction), and x86 from 294 to 111 (62% reduction). Wall-clock build time increased by ~2.2–2.5x, but end-to-end time remains unchanged (~2h37m) because test queueing was the bottleneck under the old layout. The change was authored with Claude, reflecting growing use of AI-assisted development. CUDA, ROCm, and XPU builds still use the per-Python matrix; test and upload phases also remain per-Python to isolate failures. This optimization primarily cuts compute costs on expensive runners (e.g., `linux.arm64.r7g.12xlarge.memory`) used in PyTorch's nightly CI.

Key Points
  • Runner-minutes reduced by 57% for aarch64 (244 → 104) and 62% for x86 (294 → 111) per nightly.
  • Unified build reuses non-Python artifacts across CPython versions within a single job.
  • End-to-end time unchanged at ~2h37m; wall-clock build increased but test queueing eliminated as bottleneck.

Why It Matters

Massive CI cost savings for PyTorch's nightly builds, enabling faster iteration while cutting cloud resource usage.