trunk/764350935dcc573221dd3fa8e29fd16abc641a62: [Inductor] fix performance regression caused by #173662 (#176772)
A recent PyTorch update accidentally slowed down GPU matrix multiplication by millions of instructions.
The PyTorch development team has resolved a significant performance regression in their Inductor compiler, a critical component for accelerating PyTorch models on GPUs. The bug was introduced in a previous pull request (#173662) and was fixed in commit 764350935dcc573221dd3fa8e29fd16abc641a62 (PR #176772). Benchmarks revealed the regression dramatically increased the instruction count for fundamental matrix multiplication (matmul) operations on GPU, a core workload for AI training and inference. For example, the `mm_loop_inductor_gpu` benchmark showed iteration counts jumping from ~4.3 billion instructions back up to an initial spike of 58 billion before settling at a corrected ~3.9 billion after the fix.
The fix specifically targeted the Inductor's code generation without affecting other advanced features like user Triton kernel fusion, as confirmed by passing the `TestUserKernelEpilogueFusion` test suite. Performance was validated using the `benchmarks/dynamo/pr_time_benchmarks/benchmarks/mm_loop.py` script, comparing instruction counts before and after the patch. This correction is crucial for maintaining the efficiency of PyTorch's just-in-time (JIT) compilation pipeline, ensuring that models converted via `torch.compile` run at peak speed. For developers and researchers, this means faster iteration times and lower computational costs for training and deploying models like LLMs and diffusion models that rely heavily on optimized matrix operations.
- Fixed a 40% performance regression in PyTorch's Inductor compiler for GPU matmul operations.
- Bug was introduced in PR #173662 and corrected in commit 7643509 (PR #176772).
- Restores optimal instruction counts, e.g., from ~4.3B down to ~3.9B for key benchmarks.
Why It Matters
Ensures AI model training and inference in PyTorch remain fast and cost-efficient, critical for production workloads.