Research & Papers

BF16 Tensor Cores beat FP32 SGEMM on Blackwell GPUs for science

New research shows BF16 emulation surpasses native FP32 in accuracy and speed.

Deep Dive

A team of researchers from NVIDIA, IBM, and other institutions has demonstrated that BF16 Tensor Cores on Blackwell GPUs can emulate FP32 matrix multiplication (SGEMM) with numerical and performance characteristics superior to native FP32 hardware. The key insight is that BF16 and FP32 share the same dynamic range, allowing BF16 products to be accumulated into FP32 registers at full speed using Blackwell's integrated scaling hardware. This approach exploits the abundance of reduced-precision Tensor Cores—which have grown faster than high-precision units due to AI demand—to deliver higher throughput and better accuracy for scientific computing workloads.

The team implemented a full library-level solution that correctly handles denormals, a critical detail for many scientific applications. Their benchmarks show that BF16-based emulation not only matches but exceeds native FP32 in both precision and performance, overturning the assumption that reduced-precision emulation must sacrifice quality. This work opens the door for scientists to leverage AI-optimized hardware for traditionally double-precision tasks, potentially accelerating simulations, climate modeling, and other compute-intensive fields without numerical compromise.

Key Points
  • BF16 Tensor Cores on Blackwell GPUs achieve better numerical accuracy and performance than native FP32 SGEMM.
  • Shared dynamic range between BF16 and FP32 enables full-speed accumulation in FP32 registers via Blackwell's integrated scaling.
  • Library-ready implementation correctly handles denormals, making it practical for real scientific applications.

Why It Matters

Enables scientific applications to leverage AI hardware for faster, more accurate computations without precision loss.