RSH-SpMM: A Row-Structured Hybrid Kernel for Sparse Matrix-Matrix Multiplication on GPUs
New GPU kernel tackles inefficient sparse matrix math, a major bottleneck in AI and scientific computing.
A research team has introduced RSH-SpMM, a novel GPU kernel designed to solve a persistent performance bottleneck: Sparse Matrix-Matrix Multiplication (SpMM). SpMM is a core operation in fields like graph analytics, scientific computing, and the training of sparse AI models, but its efficiency is crippled by the irregular, non-uniform patterns of real-world data. Existing GPU methods struggle with this irregularity, failing to maintain high utilization of powerful Tensor Cores and delivering unstable performance. RSH-SpMM directly attacks this problem with a fine-grained, row-structured hybrid framework.
The kernel's innovation lies in its two-pronged approach. First, it uses adaptive row partitioning and a new data representation called RS-Tile to identify and isolate dense, regular fragments within the sparse matrix that can be processed at peak speed on GPU Tensor Cores. Second, it routes the remaining highly irregular rows to a streamlined, low-overhead CUDA execution path. This hybrid strategy is further optimized with load balancing and locality-aware data reordering to enhance efficiency. Benchmarks show it consistently outperforms existing SpMM designs, delivering speedups ranging from 1.27x to over 6x, while maintaining robust performance across diverse and challenging sparse workloads.
- Achieves 1.27x to 6.13x speedup over current state-of-the-art SpMM methods on GPUs.
- Uses a hybrid kernel that routes regular data to Tensor Cores and irregular data to a lean CUDA path.
- Targets a fundamental bottleneck in graph analytics, scientific simulation, and sparse deep learning training.
Why It Matters
Faster SpMM accelerates the training of massive sparse AI models and large-scale scientific computations, reducing cost and time.