Beating vDSP: A 138 GFLOPS Radix-8 Stockham FFT on Apple Silicon via Two-Tier Register-Threadgroup Memory Decomposition
A new open-source FFT algorithm achieves a 29% speed boost over Apple's own optimized library for scientific computing.
Independent researcher Mohamed Amine Bergach has published a paper detailing a highly optimized Fast Fourier Transform (FFT) implementation that outperforms Apple's own vDSP library on the company's M-series silicon. The key innovation is a formal 'two-tier local memory model' that rethinks how data is staged on the GPU. Instead of using slower shared memory, the algorithm primarily leverages the massive 208 KiB register file as the main data residence, using the 32 KiB threadgroup memory only for efficient exchange operations. This approach, inspired by prior work on Intel integrated GPUs, minimizes costly memory access bottlenecks.
Implemented in Metal Shading Language, the radix-8 Stockham kernel achieves 138.45 GFLOPS for a 4096-point single-precision complex FFT, beating Apple's vDSP baseline of 107 GFLOPS by 29%. The research also yielded counter-intuitive hardware insights: threadgroup memory barriers on Apple GPUs are surprisingly cheap (~2 cycles), while scattered access patterns are the true performance killer. The implementation supports transform sizes from N=256 to N=16384 and is validated against Apple's reference outputs.
This work is significant because FFT is a foundational algorithm used everywhere from AI model training (for convolutional operations) and scientific simulation to audio processing and computer vision. A faster, open-source FFT library gives developers and researchers a performance-critical tool that can accelerate a wide range of applications on the ubiquitous Apple Silicon platform, potentially reducing compute time and energy consumption for intensive workloads.
- Achieves 138.45 GFLOPS for a 4096-point complex FFT, a 29% speedup over Apple's vDSP.
- Uses a novel 'two-tier' memory model prioritizing the GPU's 208 KiB register file over threadgroup memory.
- Provides open-source Metal kernels validated against Apple's library, ready for use in AI and scientific computing.
Why It Matters
Faster FFTs accelerate core computations in AI, scientific simulation, and media processing on millions of Apple devices.