Three optimized paths?

Fisher-Yates for n≤384, argsort for 384<n<32768, radix sort for n≥32768.

Up to 1.57x faster for small n, 1.35x for 10M, only 0.77x slower at 1M vs MPSGraph?

Up to 1.57x faster for small n, 1.35x for 10M, only 0.77x slower at 1M vs MPSGraph.

Uniformity verified via chi-square and adjacent-pair tests, matching CPU and CUDA accuracy?

Uniformity verified via chi-square and adjacent-pair tests, matching CPU and CUDA accuracy.

Developer Tools

PyTorch accelerates randperm on Apple Silicon with new Metal backend

Q: Up to 1.57x faster for small n, 1.35x for 10M, only 0.77x slower at 1M vs MPSGraph?

Up to 1.57x faster for small n, 1.35x for 10M, only 0.77x slower at 1M vs MPSGraph.

Q: Uniformity verified via chi-square and adjacent-pair tests, matching CPU and CUDA accuracy?

Uniformity verified via chi-square and adjacent-pair tests, matching CPU and CUDA accuracy.

PyTorch Releases May 31, 2026

⚡Up to 1.57x faster random permutations on M4 Max, replacing MPSGraph.

Deep Dive

PyTorch has migrated the randperm operation on Apple Silicon (MPS backend) from MPSGraph to native Metal kernels, achieving significant speedups across most tensor sizes. The new implementation selects one of three algorithms based on n: a single-threadgroup Fisher-Yates shuffle for n ≤ 384, uniform-key argsort for 384 < n < 32768, and a partial radix sort with tie-dedup for n ≥ 32768. This eliminates the old MPSGraph path and its caching machinery, leveraging Metal's low-level compute for better latency.

Benchmarks on an M4 Max show a 1.46x speedup at n=10 and 1.57x at n=100, with consistent gains of 1.1-1.35x at larger sizes up to 10M. The only regression is at n=1M (0.77x vs. fused MPSGraph), considered acceptable by the team. Uniformity was rigorously tested using full-distribution chi-square (n=7, all 5040 permutations) and an adjacent-ascending-pair test at n=1M, confirming no statistical bias (z within ±1 of uniform). The change is authored with Claude and reviewed by Skylion007.

Key Points

Three optimized paths: Fisher-Yates for n≤384, argsort for 384<n<32768, radix sort for n≥32768.
Up to 1.57x faster for small n, 1.35x for 10M, only 0.77x slower at 1M vs MPSGraph.
Uniformity verified via chi-square and adjacent-pair tests, matching CPU and CUDA accuracy.

Why It Matters

Faster random permutations on Apple Silicon benefit ML training workflows and scientific computing on Mac.

Read Original Article

PyTorch accelerates randperm on Apple Silicon with new Metal backend

Why It Matters

Related Articles

🚀 Stay Ahead in AI