Research & Papers

Double-Precision Matrix Multiplication Emulation via Ozaki-II Scheme with FP8 Quantization

arXiv cs.DC March 12, 2026

⚡New technique cuts required FP8 operations, enabling high-precision AI and HPC on next-gen GPUs like Blackwell Ultra.

Deep Dive

A team of researchers from Japan, including Katsuhisa Ozaki (namesake of the method), has published a paper proposing a breakthrough for high-performance computing (HPC) and AI. They developed a novel technique to emulate double-precision (FP64) matrix multiplication—a core operation for scientific simulations and AI model training—using the much faster FP8 arithmetic units found in next-generation GPUs like NVIDIA's Blackwell Ultra and Rubin architectures. This is critical because while FP64 is essential for numerical accuracy, recent hardware advances have focused on boosting low-precision formats like FP8, leaving FP64 performance gains modest.

The key innovation is adapting the established Ozaki-II emulation scheme to work with FP8 hardware, a feat previously not possible with the original algorithm. Prior methods, like the Ozaki-I scheme, could use FP8 but were less efficient. The new approach significantly reduces the total number of FP8 matrix multiplication operations required to achieve an FP64-equivalent result. This means complex simulations and large language model (LLM) training that demand high precision can now run more efficiently on cutting-edge hardware designed for AI, bridging the gap between speed and accuracy for professional workloads.

Key Points

Enables FP64 precision using FP8 hardware units on GPUs like NVIDIA Blackwell Ultra, where INT8 performance is reduced.
Novel adaptation of the Ozaki-II scheme cuts the number of required FP8 matrix multiplications vs. the older Ozaki-I method.
Addresses a critical hardware trend: future performance gains in HPC and AI depend on leveraging high-throughput low-precision arithmetic like FP8.

Why It Matters

Allows scientists and AI engineers to run high-precision calculations efficiently on the latest AI-optimized hardware, accelerating research and model development.

Read Original Article

Double-Precision Matrix Multiplication Emulation via Ozaki-II Scheme with FP8 Quantization

Why It Matters

Stay Ahead in AI