Research & Papers

[D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 [D]

A critical bug in NVIDIA's cuBLAS library leaves RTX GPUs using only 40% of available compute power for common workloads.

Deep Dive

A detailed technical analysis has uncovered a major performance regression in NVIDIA's cuBLAS (Compute Unified Basic Linear Algebra Subprograms) library, a critical component for AI and scientific computing. The bug specifically affects batched FP32 matrix multiplication workloads on consumer RTX GPUs like the new RTX 5090, where cuBLAS dispatches an inefficient kernel that uses only about 40% of the available computational resources. Testing with CUDA 13.2.51, cuBLAS 13.3.0, and driver 595.58.03 shows the issue spans matrix sizes from 256×256 to 8192×8192 with batch sizes up to 16, with previous software versions performing even worse.

The researcher demonstrated the severity by writing a simple yet efficient custom kernel that outperforms cuBLAS by 46-65% in batched mode on the RTX 5090. The custom implementation, using a Tensor Memory Accelerator (TMA) double-buffer technique, achieves 80-120% of the performance of properly optimized kernels found on professional GPUs like the H200. This reveals a clear disparity in software optimization between NVIDIA's consumer and professional lines, as the H200 implementation efficiently mixes CUTLASS and xmma families to reach 82% FMA (Fused Multiply-Add) utilization, while the RTX 5090 suffers from poor kernel selection.

The findings have significant implications for developers working with AI frameworks, scientific simulations, and any application relying on batched linear algebra operations. The performance gap means researchers and engineers using consumer RTX cards for prototyping or computation are leaving substantial performance on the table. While NVIDIA typically prioritizes optimization for its data center and professional GPUs, this bug highlights how consumer-grade hardware can suffer from suboptimal software support, potentially slowing down development cycles and increasing compute costs for teams without access to high-end professional cards.

Key Points
  • cuBLAS bug reduces RTX 5090 batched FP32 MatMul performance to ~40% of available compute across common matrix sizes
  • Custom TMA double-buffer kernel beats cuBLAS by 46-65%, achieving 80-120% of proper professional GPU kernel performance
  • NVIDIA's H200 uses optimized CUTLASS/xmma mix reaching 82% FMA utilization, showing clear software optimization disparity between consumer and pro lines

Why It Matters

AI researchers and developers using consumer RTX GPUs for training and inference are losing substantial performance due to unoptimized NVIDIA software.