Developer Tools

Faster Diffusion on Blackwell: MXFP8 and NVFP4 with Diffusers and TorchAO

New 4-bit and 8-bit formats on Blackwell GPUs slash memory use and boost diffusion model speed.

Deep Dive

NVIDIA has detailed significant performance gains for AI image and video generation by leveraging new low-precision number formats native to its latest Blackwell architecture, like the B200 GPU. In collaboration with the PyTorch ecosystem (TorchAO) and Hugging Face's Diffusers library, the team applied MXFP8 and NVFP4 quantization to popular diffusion models including FLUX.1-dev, QwenImage, and LTX-2. These microscaling formats group values into small blocks that share a scale factor, allowing for aggressive bit-depth reduction—down to just 4 bits—while preserving model accuracy, as measured by metrics like LPIPS. The result is a reproducible speedup of up to 1.26x with 8-bit MXFP8 and a substantial 1.68x with 4-bit NVFP4, alongside a memory footprint approximately 3.5x smaller than standard BF16 precision.

For developers, integrating this optimization is streamlined through TorchAO's native integration with the Diffusers library. A simple quantization configuration allows models to be loaded and compiled for the new formats, with techniques like CUDA Graphs and selective quantization applied to maximize throughput. The NVFP4 format, in particular, is designed for high-batch, compute-bound workloads, unlocking new levels of efficiency. This advancement directly addresses the major bottlenecks of memory and compute that have constrained the deployment of state-of-the-art generative visual models, making them more accessible for real-time and large-scale serving scenarios.

Key Points
  • Achieves up to 1.68x faster inference using 4-bit NVFP4 quantization on Blackwell B200 GPUs.
  • Reduces model memory footprint by ~3.5x compared to standard BF16, enabling larger models or batch sizes.
  • Integrated into Hugging Face Diffusers via TorchAO for easy adoption by developers using models like FLUX.1-dev.

Why It Matters

Dramatically lowers the cost and latency of running cutting-edge AI image generation, enabling more practical real-time applications.