Image & Video

Flow matching with optimized priors boosts rare disease image augmentation

New method re-centers generative paths to fix tail-class bias in medical imaging

Deep Dive

A new paper from Felix Nützel, Mischa Dombrowski, and Bernhard Kainz tackles a critical problem in medical AI: rare diseases are severely underrepresented in training datasets, causing classifiers to fail precisely when detection matters most. Standard generative augmentation techniques struggle because coarse disease labels aggregate diverse subtypes and acquisition settings into multi-modal conditionals, biasing generators toward dominant submodes. Meanwhile, a shared Gaussian source forces rare subpopulations through disproportionately long transport paths, degrading quality.

The authors propose Flow Matching with Optimized Subclass Priors (FMP), an offline strategy that introduces informative priors at two levels. First, they partition each coarse label into coherent submodes via Gaussian mixture modeling in the generative model's latent space. Second, they learn subclass-conditioned source distributions that re-center and re-scale the starting distribution per submode, shortening trajectories and reducing within-subclass dispersion. To prevent degenerate solutions, geometric control concentrates normalized displacement directions around learnable prototypes while capping path-length outliers. On long-tailed chest X-ray (MIMIC-LT, NIH-LT) and CT slice (CT-RATE) benchmarks, FMP consistently improves tail-class generation fidelity (FID, IRS) and downstream balanced accuracy and macro-F1 across modalities, offering a practical pipeline for rare-disease augmentation.

Key Points
  • Partitions coarse disease labels into Gaussian submodes in latent space to reduce multi-modal conditioning bias
  • Learns subclass-conditioned source distributions that shorten transport paths for rare subpopulations
  • Achieves +5–15% balanced accuracy improvement on long-tailed chest X-ray and CT benchmarks

Why It Matters

Enables more reliable AI diagnosis for rare diseases by generating high-fidelity synthetic examples where real data is scarce.