Research & Papers

LP-GEMM: Integrating Layout Propagation into GEMM Operations

New GEMM kernel eliminates redundant data packing, achieving 2.25x speedups over OpenBLAS for sequential matrix operations.

Deep Dive

A research team from the University of Campinas has published LP-GEMM, a novel approach to optimizing sequences of dependent General Matrix Multiplications (GEMMs) that dominate execution time in scientific computing and modern machine learning workloads. The core innovation addresses a fundamental limitation in state-of-the-art BLAS libraries like OpenBLAS and Intel MKL: their API requires each GEMM call to independently pack input matrices into a canonical memory layout and restore outputs, creating redundant packing and unpacking in sequential operations. LP-GEMM introduces a kernel decomposition that propagates these memory layouts across dependent operations, preserving full BLAS semantic correctness while eliminating wasteful data movement.

The researchers evaluated LP-GEMM on both x86 (with AVX-512) and RISC-V (with RVV 1.0) architectures across MLP-like and Attention-like workloads common in AI models. Results showed average speedups of 2.25x over OpenBLAS on Intel x86 for sequential GEMMs, with competitive performance gains relative to the highly optimized Intel MKL. To demonstrate practical utility beyond microbenchmarks, the team implemented a standalone C++ version of the entire Llama-3.2 inference path using exclusively BLAS-level GEMM calls, confirming the performance benefits in a real-world scenario.

This work highlights that significant performance gains are still achievable at the fundamental linear algebra layer, moving beyond optimizing single operations to optimizing the dataflow between them. The approach is architecture-agnostic, showing benefits on both traditional x86 and emerging RISC-V platforms, and integrates seamlessly with existing BLAS interfaces without requiring changes to high-level application code.

Key Points
  • Achieves 2.25x average speedup over OpenBLAS on Intel x86 for sequential GEMM operations
  • Demonstrated with a full Llama-3.2 inference path implementation using only BLAS calls
  • Eliminates redundant data packing/unpacking by propagating memory layouts across dependent operations

Why It Matters

Directly accelerates core AI and HPC workloads at the library level, potentially reducing inference latency and computational costs across the industry.