High-Performance Portable GPU Primitives for Arbitrary Types and Operators in Julia
A Julia library achieves vendor-level GPU performance while maintaining cross-platform portability across NVIDIA and AMD hardware.
Researcher Emmanuel Pilliat has developed GPUify.jl, a new Julia library that solves a critical problem in high-performance computing: achieving vendor-optimized GPU performance while maintaining cross-platform portability. The library implements fundamental parallel primitives—including scan, mapreduce, and matrix-vector operations—through a novel two-layer architecture. The lower layer, GPUify.jl, provides backend-agnostic abstractions for hardware-specific operations like warp-level shuffles and memory fences, while the upper layer builds high-performance algorithms exclusively on these portable interfaces.
In benchmark tests, GPUify.jl delivered breakthrough results. When evaluated on both an NVIDIA A40 and an AMD MI300X GPU, the library matched or exceeded the execution time of NVIDIA's highly-optimized CUB library on scan and mapreduce operations on the A40 hardware. Perhaps more impressively, it achieved cuBLAS-level throughput on matrix-vector operations across most tested configurations. This demonstrates that portable, just-in-time compiled abstractions can indeed reach the performance levels previously reserved for vendor-specific, hand-tuned libraries.
The significance lies in the proof-of-concept that GPUify.jl provides: portable code doesn't have to mean compromised performance. By building on Julia's JIT compilation capabilities and creating smart abstractions that map efficiently to different GPU architectures, Pilliat has shown that developers can write GPU-accelerated code once and deploy it across NVIDIA, AMD, and potentially other hardware without sacrificing the raw speed needed for scientific computing and AI workloads. This could dramatically reduce the development burden for teams targeting multiple GPU platforms.
- GPUify.jl matches/exceeds NVIDIA CUB performance on scan operations on A40 GPUs
- Achieves cuBLAS-level throughput on matrix-vector operations across tested configurations
- Two-layer architecture provides portable abstractions without performance penalties
Why It Matters
Enables developers to write GPU code once that runs at native speed across NVIDIA and AMD hardware, reducing multi-platform development costs.