Fixes scalar promotion?

restricts lossless int→float only, stops rounding float16 scalars on host (fixes CrossEntropyLoss drift)

New `natural_output_dtype` contract and `binary_strided_castout` kernel prevent silent corruption for ops like comparisons (→kBool)?

New `natural_output_dtype` contract and `binary_strided_castout` kernel prevent silent corruption for ops like comparisons (→kBool)

Adds tunable ILP threshold (default 256K for float) and conformance tests for Metal tensor iterator dispatch?

Adds tunable ILP threshold (default 256K for float) and conformance tests for Metal tensor iterator dispatch

Developer Tools

PyTorch #183055 fixes MPS binary tensor ops with precision and dispatch improvements

PyTorch Releases May 11, 2026

⚡New Metal backend patch resolves float16 drift and broken cosine_similarity, plus adds 10+ kernel fixes

Deep Dive

PyTorch has merged a significant patch to its Metal Performance Shaders (MPS) backend – PR #183055, authored with assistance from Claude – that resolves longstanding issues in binary tensor operations on Apple Silicon GPUs. The fix addresses two core problems: incorrect scalar promotion (which was rounding float16 scalars on the host, causing CrossEntropyLoss drift), and mismatched kernel names that led to wrong kernels being selected (e.g., `mul_dense_scalar_float2_float` for complex convolution). The new `natural_output_dtype` parameter on `exec_binary_kernel` allows ops like comparisons (which produce `kBool`) to declare their output type explicitly, routing through a new `binary_strided_castout` kernel that computes at compile-time precision and casts on store. This contract is already used by the comparison-ops migration in #183019, fixing `cdist` mps_float32 (`linalg_vector_norm(p=0)` → `ne_outf`) and inductor's `test_complex_from_real_imag_mps`.

The patch also introduces a caller-tunable ILP (instruction-level parallelism) threshold, defaulting to 256K for float outputs and off for non-float, with the unroll width visible in host names (`add_dense_ilp4_float_float`) to allow future coexistence of `ilp8`/`ilp16`. Missing kernel combos now fail loudly via `getPipelineStateForFunc` instead of silent corruption. A new `TestBinaryIteratorConformance` test suite compiles synthetic functors (`simple_add`, `simple_ge`) through the same MetalShaderLibrary path, and `test_binary_kernels` regressions cover scalar precision and in-place narrowing. The X-macro infrastructure (`C10_METAL_ALL_TYPES_FUNCTOR`) is extended to generate `val_at_offs` and `store_at_offs` automatically. This is a foundational fix that unblocks many MPS ops and prevents future silent errors.

Key Points

Fixes scalar promotion: restricts lossless int→float only, stops rounding float16 scalars on host (fixes CrossEntropyLoss drift)
New `natural_output_dtype` contract and `binary_strided_castout` kernel prevent silent corruption for ops like comparisons (→kBool)
Adds tunable ILP threshold (default 256K for float) and conformance tests for Metal tensor iterator dispatch

Why It Matters

Fixes critical MPS bugs on Apple Silicon, enabling reliable training/inference for PyTorch users on Mac.

Read Original Article

PyTorch #183055 fixes MPS binary tensor ops with precision and dispatch improvements

Why It Matters

Related Articles

🚀 Stay Ahead in AI