Developer Tools

trunk/591f9e7dbed33e3a1f985983bcc623607bce874e: [overlap] using copyengine collectives (#179937)

PyTorch Releases April 15, 2026

⚡New optimization reduces GPU contention, delivering faster AI model training without hardware changes.

Deep Dive

Meta's PyTorch team has merged a significant performance optimization (PR #179937) that makes AI model training faster on NVIDIA H100 GPUs. The update introduces "CopyEngine collectives," which replace traditional NCCL collectives when GPUs are connected via NVLink. This addresses a critical bottleneck: standard NCCL operations use GPU Stream Multiprocessors (SMs), creating contention that slows down overlapping matrix multiplications (matmuls) by 30-40%. The new method offloads collective communication to dedicated copy engines, freeing SMs for computation.

Benchmarks on an 8×H100 NVLink cluster training a Llama 3 8B model show clear gains. With a sequence length of 4K and Tensor Parallelism (TP)=1, throughput increased by 2.4%, from 9,369 to 9,593 tokens per second. Model FLOPs Utilization (MFU) also rose from 48.76% to 49.93%. At an 8K sequence length using auto-bucketing, the improvement was 2.5%. The optimization is most effective in Fully Sharded Data Parallel (FSDP) scenarios without TP contention; when TP is active, NVLink contention can currently negate the benefit, though future scheduling changes may solve this.

The implementation works as a separate pass compatible with existing bucketing strategies like `transformer_block_bucketing`. It automatically decides when to switch to CopyEngine based on heuristics that check for NVLink connectivity and compute overlap. This represents a pure software win, extracting more performance from current data center hardware and directly accelerating the development cycle for large language models.

Key Points

Boosts Llama 3 8B training throughput by 2.4% (9,593 vs. 9,369 TPS) on 8×H100 NVLink systems
Fixes 30-40% slowdown caused by NCCL collective contention with GPU SMs during matmul operations
Software-only optimization requires no hardware changes, integrating as a pass in PyTorch's bucketing schedulers

Why It Matters

This directly reduces training time and cost for AI labs, making existing GPU clusters more efficient for developing models like Llama 3.

Read Original Article

trunk/591f9e7dbed33e3a1f985983bcc623607bce874e: [overlap] using copyengine collectives (#179937)

Why It Matters

Stay Ahead in AI