Research & Papers

ParamSpMM: 1.92x faster GNN training with adaptive GPU SpMM

New parametric approach beats NVIDIA cuSPARSE by 1.92x average speedup.

Deep Dive

Sparse Matrix-Matrix Multiplication (SpMM) is a critical bottleneck in Graph Neural Network (GNN) training due to wildly varying input sparsity patterns. Existing GPU implementations, including NVIDIA cuSPARSE, fail to adapt across different graphs, leading to suboptimal performance. In a new paper accepted at GDMA 2025, Lixing Zhang and colleagues analyze these limitations and propose ParamSpMM, a parametric framework that dynamically adjusts optimization strategies per input.

ParamSpMM introduces a Parameterized Compressed Sparse Row (PCSR) format that can seamlessly toggle between tiling, row-reordering, and warp-level techniques. It pairs this with a lightweight ML-based SpMM-decider that predicts the optimal combination—without runtime overhead. In benchmarks across diverse GNN workloads, ParamSpMM delivers an average 1.92x speedup over cuSPARSE, making it a drop-in upgrade for PyTorch Geometric and similar frameworks. The work underscores the importance of adaptive kernels for production GNN systems.

Key Points
  • ParamSpMM uses a new PCSR data structure to flexibly integrate multiple SpMM optimization techniques.
  • An ML-based 'SpMM-decider' predicts the best configuration per input, avoiding one-size-fits-all heuristics.
  • Achieves average 1.92x speedup over NVIDIA cuSPARSE on diverse GNN workloads.

Why It Matters

Faster SpMM directly accelerates GNN training on GPUs, enabling larger graphs and more iterations for real-world AI.