Runner-minutes reduced by 57% for aarch64 (244 → 104) and 62% for x86 (294 → 111) per nightly?

Runner-minutes reduced by 57% for aarch64 (244 → 104) and 62% for x86 (294 → 111) per nightly.

Unified build reuses non-Python artifacts across CPython versions within a single job?

Unified build reuses non-Python artifacts across CPython versions within a single job.

End-to-end time unchanged at ~2h37m; wall-clock build increased but test queueing eliminated as bottleneck?

End-to-end time unchanged at ~2h37m; wall-clock build increased but test queueing eliminated as bottleneck.

Developer Tools

PyTorch unifies CPU wheel builds, slashing runner-minutes by 60%

PyTorch Releases May 16, 2026

⚡One runner now builds both aarch64 and x86 wheels, saving 140-183 minutes per nightly.

Deep Dive

PyTorch has merged a significant CI optimization (PR #183931) that consolidates building of CPU wheels for both aarch64 and x86 architectures onto a single runner. Previously, each Python version (3.9–3.13) triggered a separate build job, consuming up to 294 runner-minutes for x86 and 244 for aarch64. The new unified approach loops over all desired Python versions within one job, reusing the `build/` directory across iterations and only recompiling Python-specific parts like `libtorch_python` and `_C`. The key innovation is a template parameter `unified_arch_types = ['cpu', 'cpu-aarch64']` that acts as the partition switch, allowing future additions (e.g., CUDA variants) to follow the same pattern.

Results are striking: aarch64 runner-minutes dropped from 244 to 104 (57% reduction), and x86 from 294 to 111 (62% reduction). Wall-clock build time increased by ~2.2–2.5x, but end-to-end time remains unchanged (~2h37m) because test queueing was the bottleneck under the old layout. The change was authored with Claude, reflecting growing use of AI-assisted development. CUDA, ROCm, and XPU builds still use the per-Python matrix; test and upload phases also remain per-Python to isolate failures. This optimization primarily cuts compute costs on expensive runners (e.g., `linux.arm64.r7g.12xlarge.memory`) used in PyTorch's nightly CI.

Key Points

Runner-minutes reduced by 57% for aarch64 (244 → 104) and 62% for x86 (294 → 111) per nightly.
Unified build reuses non-Python artifacts across CPython versions within a single job.
End-to-end time unchanged at ~2h37m; wall-clock build increased but test queueing eliminated as bottleneck.

Why It Matters

Massive CI cost savings for PyTorch's nightly builds, enabling faster iteration while cutting cloud resource usage.

Read Original Article

PyTorch unifies CPU wheel builds, slashing runner-minutes by 60%

Why It Matters

Related Articles

🚀 Stay Ahead in AI