DFlash: Block Diffusion for Flash Speculative Decoding.
New speculative decoding method accelerates Stable Diffusion by 2.5x with minimal quality loss.
Z-Lab, a research group, has open-sourced DFlash, a new technique called "Block Diffusion for Flash Speculative Decoding." This method applies the concept of speculative decoding—popular in speeding up large language models (LLMs) like Llama and GPT—to the challenging domain of image generation. Traditional speculative decoding uses a small, fast draft model to predict tokens that a larger target model then verifies. DFlash adapts this by having a draft model predict entire blocks of a latent image representation, which the target diffusion model (like Stable Diffusion 1.5) then accepts or rejects. This block-level approach is key to making speculative decoding practical for the iterative denoising process of diffusion models.
In benchmarks, DFlash demonstrates a significant performance boost. It achieves a 2.5x latency reduction for Stable Diffusion 1.5, meaning images are generated much faster. Crucially, this speedup comes with "negligible" impact on output quality, as confirmed by standard evaluation metrics like Fréchet Inception Distance (FID) and CLIP score. The project is fully open-source, with code on GitHub and model weights available on Hugging Face, allowing developers and researchers to integrate it into their own pipelines. This represents a major step in making high-quality AI image generation more accessible and responsive without relying on more expensive computing hardware.
- Applies speculative decoding to diffusion models for a 2.5x speedup in Stable Diffusion 1.5.
- Uses a draft model to predict image blocks for verification by the target model, minimizing quality loss.
- Fully open-source implementation available on GitHub and Hugging Face for immediate integration.
Why It Matters
Enables faster, cheaper AI image generation for applications, improving user experience and reducing computational costs.