trunk/916b711d81211d78e93d88e3b1773b5b19fc0373: Update eigh CUDA heuristics (#175403)
A single heuristic change in PyTorch delivers over 100x speedup for critical linear algebra operations on GPUs.
The PyTorch team has merged a critical performance fix (commit 916b711) that dramatically accelerates a fundamental linear algebra operation on NVIDIA GPUs. The update changes the backend selection heuristic for `torch.linalg.eigh`, the function for computing eigenvalues and eigenvectors of complex Hermitian or real symmetric matrices. Previously, this function was found to be around 100x slower than NVIDIA's CuPy library for batched inputs, as reported in GitHub issues #174674 and #174601. The suboptimal heuristics, developed in pull request #53040, were not leveraging recent improvements in the cuSOLVER library.
Benchmarking revealed that the `linalg_eigh_cusolver_syevj_batched` backend is the fastest for nearly all matrix sizes, outperforming other backends like `syevd` by margins as small as 0.05ms. The solution was to dispatch to this backend unconditionally. This change renders the `syevj` backend obsolete, leading to its removal from the codebase in `CUDASolver.cpp/h`. The fix was tested using `test/test_linalg.py`, with any observed failures being unrelated to the change and present in the current nightly build.
The performance impact is substantial. Code from the original issue now runs over 100x faster than the previous PyTorch nightly build and outperforms CuPy by approximately 8x. This optimization is crucial for workloads in machine learning, quantum chemistry, and physics simulations that heavily rely on batched eigenvalue decompositions. The fix, approved by core maintainers @nikitaved and @lezcano, directly addresses performance regression issue #175585 and will benefit all users performing large-scale linear algebra computations on CUDA hardware.
- Commit 916b711 updates PyTorch's CUDA heuristics for `torch.linalg.eigh`, fixing a 100x slowdown versus CuPy.
- The fix unconditionally uses the `linalg_eigh_cusolver_syevj_batched` backend, making operations over 100x faster than the old nightly and 8x faster than CuPy.
- The obsolete `syevj` backend was removed from the codebase, resolving performance issues #174674, #174601, and #175585.
Why It Matters
Massively accelerates machine learning and scientific computing pipelines that depend on batched eigenvalue calculations on GPUs.