Do Generative Metrics Predict YOLO Performance? An Evaluation Across Models, Augmentation Ratios, and Dataset Complexity
Research shows standard synthetic image metrics like FID often fail to predict real-world object detection accuracy.
A new computer vision study challenges the reliability of standard generative metrics for evaluating synthetic training data. Researchers from an international team conducted a controlled evaluation examining whether metrics like Fréchet Inception Distance (FID) actually predict downstream object detection performance when using synthetic images to augment training sets.
The study tested six different generators (GAN-, diffusion-, and hybrid-based) across three distinct detection regimes: Traffic Signs (sparse), Cityscapes Pedestrian (dense/occlusion-heavy), and COCO PottedPlant (multi-instance/high-variability). They evaluated YOLOv11 performance with augmentation ratios ranging from 10% to 150% of real training data, using both from-scratch training and COCO-pretrained initialization. Results showed synthetic augmentation delivered substantial gains in challenging scenarios—up to +7.6% relative mAP for Pedestrian detection and +30.6% for PottedPlant detection—but offered marginal benefits for Traffic Signs and under pretrained fine-tuning.
Crucially, the research implemented a matched-size bootstrap protocol to compute both global feature-space metrics (using Inception-v3 and DINOv2 embeddings) and object-centric distribution distances over bounding-box statistics. After controlling for augmentation quantity through residualized correlations, many apparent metric-performance associations weakened significantly. The study reveals that metric-performance alignment is strongly regime-dependent, suggesting practitioners cannot rely on standard generative metrics alone when evaluating synthetic datasets for object detection tasks.
- Synthetic augmentation boosted YOLOv11 performance up to 30.6% in challenging multi-instance scenarios
- Standard metrics like FID showed weak correlation with detection mAP after controlling for augmentation quantity
- Study tested 6 generators across 3 datasets with augmentation ratios from 10% to 150%
Why It Matters
Computer vision teams can't trust standard metrics to evaluate synthetic training data—they need task-specific validation instead.