Synthetic Data Generation for Long-Tail Medical Image Classification: A Case Study in Skin Lesions
New AI pipeline creates realistic medical images to fix class imbalance and improve diagnosis.
Long-tailed class distributions are a persistent challenge in multi-class medical datasets, where rare but critical diseases (e.g., certain skin lesions) have very few training samples. Standard deep learning models severely underperform on these tail classes, and existing fixes—specialized architectures, rebalanced losses, or handcrafted augmentation—offer only marginal gains due to limited variability.
To overcome this, the team developed a diffusion-model-driven pipeline combining inpainting and an Out-of-Distribution (OOD) post-selection step. The inpainting diffusion model synthesizes realistic lesion images by filling in masked regions, while the OOD filter ensures only clinically meaningful and diverse samples are kept. Tested on the large, imbalanced ISIC2019 skin lesion dataset, the approach yielded substantial overall gains, with a 28% improvement on the class with the fewest samples, demonstrating diffusion-based augmentation as a powerful tool for medical AI robustness.
- Novel pipeline uses inpainting diffusion model + OOD post-selection to generate diverse synthetic medical images.
- Achieves over 28% improvement on the rarest skin lesion class in the ISIC2019 benchmark.
- Addresses long-tail imbalance, enabling better AI diagnostic accuracy for rare, high-risk diseases.
Why It Matters
Improves AI diagnosis of rare diseases by generating realistic synthetic data, reducing critical misdiagnoses.