viable/strict/1776928667: Split onehot checks for CPU and accelerators (#179831)
A simple PR fix boosts XPU performance by avoiding costly CPU data transfers.
PyTorch has merged a pull request (#179831) that optimizes the OneHot operation by splitting boundary checks for CPU and accelerators. The change, tagged as viable/strict/1776928667, addresses an issue raised by Intel's torch-xpu-ops repository (issue #3284). Previously, OneHot performed boundary validation on all devices except CUDA, MPS, XLA, and PrivateUser1. This meant Intel's XPU accelerator was not on the exclusion list, causing costly device-to-host (D2H) memory transfers that significantly degraded performance.
The fix, approved by contributors guangyey and albanD, flips the conditional logic: now boundary checks only run on CPU, while all accelerators (including XPU, CUDA, MPS, XLA, and PrivateUser1) skip validation entirely. This is both simpler and more future-proof, as any new accelerator will automatically benefit without needing to be added to an exclusion list. The change is a small but impactful optimization that improves inference and training performance on XPU hardware, and sets a cleaner pattern for PyTorch's device-agnostic codebase.
- PyTorch PR #179831 splits OneHot boundary checks: CPU-only validation vs. accelerator skip.
- Intel XPU was missing from the accelerator exclusion list, causing costly D2H memory transfers.
- New logic flips the condition to only check CPU, automatically benefiting all accelerators including future ones.
Why It Matters
Small optimization that eliminates unnecessary CPU data transfers, boosting accelerator performance for OneHot operations in PyTorch.