Research & Papers

FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication

A new framework makes theoretical lower-complexity matrix multiplication practical, outperforming hardware limits.

Deep Dive

A team of researchers from multiple institutions has released FalconGEMM, a cross-platform framework that automates the deployment, optimization, and selection of Lower-Complexity Matrix Multiplication Algorithms (LCMAs). These algorithms promise to surpass conventional hardware peak performance by reducing computational complexity, but have historically been difficult to deploy in production. FalconGEMM solves this with three key innovations: a Deployment Module that generates portable code for various hardware and input configurations, an Execution Module using group-parallel optimizations to maximize on-chip data reuse and reduce bandwidth overhead, and a Decision Module with a lightweight analytical performance model that selects the optimal strategy based on matrix shapes and hardware profiles.

Extensive evaluation on LLM workloads across NVIDIA H20 and A100 GPUs, as well as ARM and x86 CPUs, shows FalconGEMM outperforming industry-standard libraries like cuBLAS, CUTLASS, and Intel MKL by 7.59% to 17.85%. It also outperforms the LCMA competitor AlphaTensor by 12.41% to 55.61%. The framework makes the theoretical promise of lower-complexity matrix multiplication practical for production deployment across the heterogeneous landscape of modern hardware, potentially accelerating LLM training and inference without requiring specialized hardware.

Key Points
  • Outperforms cuBLAS/CUTLASS/MKL by 7.59%-17.85% on LLM workloads across GPU (H20, A100) and CPU (ARM, x86).
  • Beats AlphaTensor (a previous LCMA optimizer) by 12.41%-55.61% using a lightweight analytical performance model.
  • Three-module design: portable code generation, group-parallel execution with data reuse, and automated strategy selection.

Why It Matters

FalconGEMM makes faster, energy-efficient matrix multiplication practical for LLM training and inference on any hardware.