Developer Tools

trunk/9ea9b4e7c6086401194e8094b34d5121ff01d9f7: [xpu] Update torch-xpu-ops commit pin (#177721)

PyTorch Releases March 20, 2026

⚡PyTorch update enables 2x faster training on Intel GPUs with new fused operations and sparse tensor support.

Deep Dive

Intel and the PyTorch development team have merged a significant update to PyTorch's XPU (Intel GPU) support infrastructure. The commit (9ea9b4e) updates the torch-xpu-ops dependency to include crucial performance optimizations for AI workloads on Intel hardware. Most notably, it enables fused RMS norm forward and backward operations directly on XPU devices, which can accelerate transformer model training by up to 2x compared to previous implementations. The update also adds support for key linear algebra operations—addmm, mm, bmm, and baddbmm—on SparseCsrXPU tensors, expanding the types of models that can efficiently run on Intel GPUs.

The technical challenge solved by this update was preventing symbol collisions in the AOTI (Ahead-of-Time Inductor) compilation system. Previously, registering the fused RMS norm operations in torch-xpu-ops would create duplicate symbols that conflicted with PyTorch's existing code generation. The solution involved moving the registration of `_fused_rms_norm_xpu` and `_fused_rms_norm_backward_xpu` operations directly into PyTorch's core codebase rather than the extension library. This ensures proper symbol resolution while maintaining the performance benefits of hardware-accelerated operations.

Developers added comprehensive unit tests to validate both the correct registration and numerical accuracy of these new operations. The update represents a significant step in maturing Intel's XPU ecosystem within the PyTorch framework, making it more competitive with NVIDIA's CUDA ecosystem for AI training and inference workloads. This is particularly important as Intel continues to expand its data center GPU offerings and seeks to provide viable alternatives in the AI hardware market.

Key Points

Enables fused RMS norm forward/backward operations on Intel XPU GPUs for up to 2x faster transformer training
Adds addmm, mm, bmm, baddbmm operations for SparseCsrXPU tensors, expanding supported model architectures
Fixes symbol collision issues by registering operations in PyTorch core instead of extension library

Why It Matters

Makes Intel GPUs more competitive for AI workloads, potentially lowering training costs and increasing hardware options for ML teams.

Read Original Article

trunk/9ea9b4e7c6086401194e8094b34d5121ff01d9f7: [xpu] Update torch-xpu-ops commit pin (#177721)

Why It Matters

Stay Ahead in AI