PyTorch #183055 fixes MPS binary tensor ops with precision and dispatch improvements
New Metal backend patch resolves float16 drift and broken cosine_similarity, plus adds 10+ kernel fixes
PyTorch has merged a significant patch to its Metal Performance Shaders (MPS) backend – PR #183055, authored with assistance from Claude – that resolves longstanding issues in binary tensor operations on Apple Silicon GPUs. The fix addresses two core problems: incorrect scalar promotion (which was rounding float16 scalars on the host, causing CrossEntropyLoss drift), and mismatched kernel names that led to wrong kernels being selected (e.g., `mul_dense_scalar_float2_float` for complex convolution). The new `natural_output_dtype` parameter on `exec_binary_kernel` allows ops like comparisons (which produce `kBool`) to declare their output type explicitly, routing through a new `binary_strided_castout` kernel that computes at compile-time precision and casts on store. This contract is already used by the comparison-ops migration in #183019, fixing `cdist` mps_float32 (`linalg_vector_norm(p=0)` → `ne_outf`) and inductor's `test_complex_from_real_imag_mps`.
The patch also introduces a caller-tunable ILP (instruction-level parallelism) threshold, defaulting to 256K for float outputs and off for non-float, with the unroll width visible in host names (`add_dense_ilp4_float_float`) to allow future coexistence of `ilp8`/`ilp16`. Missing kernel combos now fail loudly via `getPipelineStateForFunc` instead of silent corruption. A new `TestBinaryIteratorConformance` test suite compiles synthetic functors (`simple_add`, `simple_ge`) through the same MetalShaderLibrary path, and `test_binary_kernels` regressions cover scalar precision and in-place narrowing. The X-macro infrastructure (`C10_METAL_ALL_TYPES_FUNCTOR`) is extended to generate `val_at_offs` and `store_at_offs` automatically. This is a foundational fix that unblocks many MPS ops and prevents future silent errors.
- Fixes scalar promotion: restricts lossless int→float only, stops rounding float16 scalars on host (fixes CrossEntropyLoss drift)
- New `natural_output_dtype` contract and `binary_strided_castout` kernel prevent silent corruption for ops like comparisons (→kBool)
- Adds tunable ILP threshold (default 256K for float) and conformance tests for Metal tensor iterator dispatch
Why It Matters
Fixes critical MPS bugs on Apple Silicon, enabling reliable training/inference for PyTorch users on Mac.