Batched linalg.eigh on CUDA is up to 100x faster thanks to updated cuSolver backend selection?

Batched linalg.eigh on CUDA is up to 100x faster thanks to updated cuSolver backend selection.

New torch.accelerator.Graph API provides device-agnostic graph capture and replay across CUDA, XPU, and more?

New torch.accelerator.Graph API provides device-agnostic graph capture and replay across CUDA, XPU, and more.

Adagrad optimizer now supports fused=True, joining Adam/AdamW/SGD for single-kernel optimizer steps?

Adagrad optimizer now supports fused=True, joining Adam/AdamW/SGD for single-kernel optimizer steps.

Developer Tools

PyTorch 2.12 delivers 100x faster eigendecomposition and unified graph API

PyTorch Blog May 14, 2026

⚡Batched matrix solvers now 100x faster, plus a new device-agnostic graph capture API.

Deep Dive

PyTorch 2.12 delivers major performance gains, most notably a batched eigenvalue decomposition (linalg.eigh) that is up to 100x faster on CUDA. The overhauled cuSolver backend replaces the legacy MAGMA backend and automatically uses syevj_batched for batched symmetric/Hermitian problems, turning multi-minute workloads into seconds. Additionally, the Adagrad optimizer now supports fused=True, enabling a single-kernel optimizer step that reduces launch overhead and memory traffic—joining Adam, AdamW, and SGD in the fused family.

The new torch.accelerator.Graph API provides a device-agnostic interface for graph capture and replay, unifying CUDA, XPU, and out-of-tree backends with a consistent user-facing API. Each backend registers its own implementation via a lightweight GraphImplInterface. torch.export.save also adds support for Microscaling (MX) quantization formats, allowing full export of aggressively compressed models. These features, along with 2,926 commits from 457 contributors, reinforce PyTorch's trajectory from a research-first framework to a unified, hardware-agnostic platform for production AI.

Key Points

Batched linalg.eigh on CUDA is up to 100x faster thanks to updated cuSolver backend selection.
New torch.accelerator.Graph API provides device-agnostic graph capture and replay across CUDA, XPU, and more.
Adagrad optimizer now supports fused=True, joining Adam/AdamW/SGD for single-kernel optimizer steps.

Why It Matters

PyTorch 2.12 accelerates scientific computing and simplifies multi-backend AI deployment for professionals.

Read Original Article

PyTorch 2.12 delivers 100x faster eigendecomposition and unified graph API

Why It Matters

Related Articles

🚀 Stay Ahead in AI