trunk/11d35a3ad40d50ae7cdae5635b33359a4c60f208: Add GEMM configs to XPU autotuning heuristic (#177647)
A new commit adds specialized GEMM configurations, improving performance for tall-skinny matrix shapes by up to 30%.
The PyTorch team has merged a significant performance optimization for AI workloads running on Intel XPU GPUs. The commit (11d35a3ad40d50ae7cdae5635b33359a4c60f208) adds two new matrix multiplication (GEMM) configurations specifically for the XPU autotuning heuristic. This change is designed to address inefficiencies when processing 'tall-skinny' matrix shapes—common in certain AI model layers—where one dimension is vastly larger than the others. By adding configs that set BLOCK_N=64 to match the N dimension exactly, the system can reduce the number of workgroups required, leading to better GPU occupancy and computational efficiency.
This optimization was benchmarked on Intel's BMG GPU and directly resolves GitHub issue #6012 in the intel-xpu-backend-for-triton repository. The fix is a targeted enhancement for PyTorch's Triton compiler backend, which is crucial for generating high-performance GPU code. For developers and researchers using Intel hardware for AI training or inference, this update means specific operations will execute faster and more efficiently, improving overall throughput for models that utilize these particular tensor shapes. It represents a continued effort to refine PyTorch's performance across diverse hardware platforms.
- Targets 'tall-skinny' GEMM shapes (e.g., M=10000, N=64, K=64) common in AI models.
- Adds XPU-specific configs setting BLOCK_N=64 to match matrix dimension, improving GPU occupancy.
- Directly fixes a reported performance issue (#6012) in the Intel XPU backend for Triton.
Why It Matters
Boosts performance for AI training/inference on Intel GPUs, making PyTorch more competitive and efficient on alternative hardware.