Research & Papers

BLEST turbocharges BFS on GPUs with 22x speedup using Tensor Cores

New framework reformulates BFS as matrix ops on Tensor Cores, achieving 22x over GAP.

Deep Dive

Modern GPUs pack Tensor Cores (TCs) for high-throughput matrix operations, but graph algorithms like Breadth-First Search (BFS) struggle because of irregular, data-dependent access patterns. Now, Deniz Elbek and Kamer Kaya from Sabancı University have unveiled BLEST, a framework that reformulates BFS as a bit-level sparse matrix-vector computation tailored for TCs. BLEST introduces Binarized Virtual Slice Sets (BVSS), a graph representation that partitions work into balanced warp-level units and schedules only frontier-relevant regions, dramatically reducing memory inefficiency and synchronization overhead.

BLEST also employs an optimized TC layout that maps neighbor checks onto binary MMA (matrix multiply-accumulate) instructions without wasted outputs, slashing the number of required MMA calls by 8x compared to prior layouts. To mitigate atomic and cache bottlenecks, it incorporates a lazy vertex-update scheme, and it dynamically switches from TCs to CUDA cores when that becomes more efficient. The framework extends to multi-source BFS and closeness centrality workloads, and introduces a scalable graph reordering method to improve compression for scale-free graphs. Across a broad set of real-world graphs, BLEST achieves average speedups of 22.0x over GAP, 7.7x over Gunrock, 8.1x over GSWITCH, and 5.9x over BerryBees, establishing a new BFS baseline on GPUs. In a landmark demonstration, BLEST computed exact closeness centralities for 65.6 million vertices in a social network with 3.6 billion edges in under an hour using 100 H100 GPUs.

Key Points
  • BLEST reformulates BFS as bit-level sparse matrix-vector ops on Tensor Cores, cutting MMA calls by 8x vs. prior layouts.
  • Achieves 22x speedup over GAP and 7.7x over Gunrock across diverse real-world graphs.
  • Processed 65.6M vertices and 3.6B edges for exact closeness centrality in one hour on 100 H100 GPUs.

Why It Matters

Makes graph analytics on GPUs dramatically faster, enabling real-time analysis of massive social and infrastructure networks.