PRISM uses Processing-In-Memory to accelerate tensor decomposition 2.64x
New PIM method speeds up sparse tensor decomposition by over 2.6x on UPMEM hardware.
Sparse tensor decomposition is fundamental to many machine learning pipelines, but its core operation—spMTTKRP (Sparse Matricized Tensor Times Khatri-Rao Product)—is notoriously memory-bound, limiting performance on conventional processors. PRISM, presented by researchers from Universidade de Lisboa, is the first work to tackle this bottleneck using Processing-In-Memory (PIM) technology, specifically UPMEM's distributed memory system. The approach includes careful exploration of partitioning strategies, number formats (e.g., mixed precision), and kernel optimizations, plus a heterogeneous collaboration mode that splits work between PIM and CPU cores.
On the UPMEM platform, PRISM delivers up to 2.37x speedup over the best CPU implementations when running purely in memory, and up to 2.64x when using heterogeneous CPU+PIM execution. The team also reports that resource consumption efficiency—measured as peak performance fraction usage—is significantly higher than both CPU and GPU alternatives. However, the UPMEM distributed memory system can degrade performance on certain workloads due to data movement overheads. Accepted at IISWC '25, PRISM opens a promising direction for accelerating tensor algebra in memory-constrained AI workloads.
- First PIM-based approach for the memory-bound spMTTKRP operation in tensor decomposition
- Up to 2.37x speedup with PIM-only and 2.64x with heterogeneous CPU+PIM vs. state-of-the-art CPU
- Achieves higher resource efficiency (peak performance fraction) than both CPU and GPU implementations
Why It Matters
Tensor decomposition is critical for large-scale ML; PRISM shows PIM can dramatically accelerate memory-bound operations, paving the way for faster AI training.