trunk/27b9a62f7763be367822cab074fc78723808d767: [MPS] Add ILP variant for binary tensor iterators (#182155)
New ILP optimization speeds up MPS binary operations by up to 28% on Apple Silicon.
PyTorch's latest contribution expands the ILP_PER_THREAD dispatch strategy from unary to binary tensor operations on the MPS (Metal) backend, using ILP (Instruction-Level Parallelism) to pack four operations per thread for wider memory loads and stores. The new `binary_dense_ilp` variant, added in `c10/metal/indexing.h`, is wired into `exec_binary_kernel` when conditions are met: contiguous memory layout, floating-point dtype, no cast/broadcast/scalar/alpha, and `numel >= 256K` (same threshold as the unary path). Each ILP iteration includes `#pragma unroll` to ensure the compiler emits efficient wide loads/stores, and the same pragma was retroactively added to the unary loops for consistency.
Benchmarked on M-series chips with shapes like `(1, 1024, 4096)`, the ILP variant shows 15–28% speedups for `mul` and `hypot` on fp16/bf16, 12–22% for `atan2`, and 1–2% for large fp32 tensors (already near memory bandwidth). Small compute-heavy fp32 shapes may regress, hence the 256K guard. The PR also adds `PYTORCH_BINARY_FORCE_FLAVOR=ilp|scalar` for benchmarking, mirroring the unary counterpart. Interestingly, the code was authored with assistance from Claude, marking a modern collaboration between human engineers and AI.
- Adds `binary_dense_ilp` in `c10/metal/indexing.h` with `ILP_PER_THREAD = 4` and `#pragma unroll` for wide loads/stores.
- Eligible for contiguous, floating-point tensors with no cast/broadcast/scalar/alpha and `numel >= 256K`.
- Performance gains of 15–28% on fp16/bf16 `mul`/`hypot`/`atan2` on M-series; 1–2% on large fp32 shapes.
Why It Matters
Speeds up core binary operations on Apple Silicon, making PyTorch more competitive for M-series machine learning workloads.