Research & Papers

PackSELL: A Sparse Matrix Format for Precision-Agnostic High-Performance SpMV

New sparse matrix format achieves FP32 accuracy while outperforming NVIDIA's FP16 library by 63%.

Deep Dive

Researchers Kengo Suzuki and Takeshi Iwashita have introduced PackSELL, a novel sparse matrix storage format designed to maximize performance on GPUs while supporting flexible, precision-agnostic data representations. Building on the existing Sliced ELLPACK (SELL) format, PackSELL's key innovation is a packing scheme that combines delta-encoded column indices with their corresponding values into single machine words. This drastically reduces the memory footprint and data movement during the critical sparse matrix-vector multiplication (SpMV) operation, a bottleneck in scientific computing and AI. The design grants fine-grained control over bit allocation, allowing developers to use non-standard, customized numerical formats tailored to their specific accuracy and performance needs.

Experimental results demonstrate significant performance gains. When configured for standard half-precision (FP16), PackSELL-based SpMV kernels outperformed NVIDIA's optimized cuSPARSE library by up to 1.63x. More impressively, when using custom bit allocations, PackSELL achieved the accuracy of full 32-bit precision (FP32) while still exceeding the raw speed of FP16 cuSPARSE. This breakthrough extends to complete algorithms; a mixed-precision Preconditioned Conjugate Gradient (PCG) solver leveraging PackSELL achieved a 2.09x speedup over a standard full-precision solver without sacrificing result quality. This makes it highly relevant for large-scale simulations in fields like computational fluid dynamics and the training of massive neural networks.

Key Points
  • PackSELL stores index-delta-value triples in single words, cutting memory use and data movement for SpMV on GPUs.
  • Enables custom numerical formats via fine-grained bit control, achieving FP32 accuracy while beating FP16 cuSPARSE performance.
  • Accelerates real-world solvers, delivering a 2.09x speedup for a mixed-precision PCG algorithm over a standard FP32 implementation.

Why It Matters

Dramatically accelerates large-scale scientific simulations and AI model training by making sparse linear algebra on GPUs far more efficient.