trunk/021df83822aa0a40043eaf61bb95f673b714a542: [MPS] Migrate bernoulli to Metal (#182210)
PyTorch's new Metal backend makes Bernoulli up to 10x faster on Apple Silicon
A recent commit by PyTorch maintainer Nikita Shulga (`malfet`) overhauls the `bernoulli` random number generation on Apple's MPS (Metal Performance Shaders) backend. The old implementation relied on MPSGraph's slowly performing random number generation, which became a bottleneck for sampling operations. The new approach leverages PyTorch's existing Philox RNG mechanism, already proven in CUDA backends, providing a substantial performance lift.
Benchmarks run on Apple Silicon with a tensor of 16 million elements show across-the-board improvements. For example, `torch.bfloat16` hits 67 GB/s throughput, `torch.float16` reaches 101 GB/s, and even integer types like `int64` see 27 GB/s. The commit also includes rigorous statistical validation using binomial distribution fits and Q-Q plots to ensure the output remains correct despite the change in implementation. Developers using PyTorch on M-series Macs will notice faster calls to `torch.bernoulli()` and `tensor.bernoulli_()`, especially when generating large Monte Carlo samples or dropout masks.
- Philox RNG replaces MPSGraph based Bernoulli, yielding 10-100x speed improvements on 16M elements.
- All common dtypes (float32, float16, bfloat16, int8/16/32/64, uint8, bool) benefit, with throughput up to 101 GB/s.
- Statistical validation (binomial PMF, Q-Q plots) confirms sample quality matches CPU implementation.
Why It Matters
Faster Bernoulli on Apple Silicon enables quicker dropout, data augmentation, and Monte Carlo simulation in PyTorch.