Duality Models: An Embarrassingly Simple One-step Generation Paradigm
A new AI architecture from Peng Sun et al. generates high-fidelity images in 2 steps, achieving a record 1.79 FID on ImageNet 256x256.
A team of researchers led by Peng Sun has introduced Duality Models (DuMo), a breakthrough architecture that rethinks the training paradigm for consistency-based generative models. Published on arXiv, the paper addresses a critical inefficiency in models like Shortcut and MeanFlow, which traditionally partition their training budget—often dedicating 75% of samples to a multi-step objective for stability, leaving the crucial few-step generation objective undertrained. This separation creates a fundamental trade-off that harms convergence and limits scalability.
DuMo's innovation is an 'embarrassingly simple' shift to a 'one input, dual output' design. Instead of a single output, a shared backbone network with dual heads simultaneously predicts both the velocity ($v_t$) and the flow-map ($u_t$) from a single input ($x_t$). This key architectural change applies the geometric constraints from the multi-step objective to every single training sample. It effectively bounds the estimation error for the few-step generation path without needing to separate the training objectives, thereby dramatically improving training stability and sample efficiency.
The practical results are striking. When applied to a 679M-parameter Diffusion Transformer (DiT) with an SD-VAE decoder, DuMo achieves a state-of-the-art Fréchet Inception Distance (FID) score of 1.79 on the challenging ImageNet 256x256 benchmark. Crucially, it reaches this benchmark-quality image fidelity in just 2 generation steps, representing a massive leap in inference speed compared to traditional diffusion models that require dozens or hundreds of steps. This work provides a more unified and efficient training framework that could accelerate the development of high-quality, real-time generative models for images, video, and audio.
- Novel 'one input, dual output' paradigm uses a shared backbone with two heads to predict velocity and flow-map simultaneously.
- Eliminates the 75% training budget partition required by prior models like MeanFlow, applying multi-step constraints to every sample.
- Achieves a record 1.79 FID on ImageNet 256x256 with a 679M DiT model, generating high-quality images in only 2 steps.
Why It Matters
Enables near-instant, high-fidelity image generation, drastically reducing compute costs and latency for real-time AI applications.