Representation-Conditioned Diffusion Models beat real data by 2% on ImageNet100
Synthetic images from DINOv2/Conditioned diffusion outperform real training data in classification.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A new paper from Linköping University presents a significant advance in using generative AI to produce training data for computer vision. The authors propose representation-conditioned diffusion models, where a latent diffusion model is guided by learned visual representations (from DINOv2, DINOv3, or CLIP) instead of simple class labels. This conditioning strategy drastically improves sample quality and mode coverage—the synthetic images better capture the diversity of real-world data. On the ImageNet100 benchmark, their approach achieved +10.76 percentage points (p.p.) top-1 accuracy over the baseline class-conditioned generation. More impressively, when the synthetic dataset was scaled up, the classifier trained solely on generated images outperformed one trained on the original real ImageNet100 data by +2.0 p.p. top-1 accuracy.
The work also demonstrates practical use cases: using generated images for data augmentation outperforms classical augmentation techniques, and the conditioning embeddings can be used to filter low-quality generated samples, further boosting training value. The method essentially turns diffusion models into a guided data factory, where the representation space acts as a control lever. While the paper focuses on image classification, the implications extend to any visual learning task where data is scarce or expensive to collect. The authors suggest that representation-conditioned generation could complement or even replace real-world datasets in large-scale visual learning, potentially reducing the need for costly annotation and curation.
- Representation-conditioned diffusion (using DINOv2/CLIP embeddings) yields +10.76 p.p. higher top-1 accuracy on ImageNet100 vs. class-conditioned generation.
- Scaling synthetic dataset size allows a classifier to beat real-data training by +2.0 p.p. top-1 accuracy.
- Generated images improve augmentation beyond classical methods, and conditioning embeddings enable effective sample filtering.
Why It Matters
This could reduce reliance on expensive real datasets by offering controllable, high-quality synthetic training data that outperforms the real thing.