FetalCLIP and USFM top benchmark for fetal ultrasound classification
FetalCLIP hits F1 0.9731 on out-of-domain fetal plane data
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A new study on arXiv provides the first comprehensive benchmark of ultrasound-specific foundation models for fetal plane classification. The work, led by Leya Barrientos and colleagues, tested four models (USFM, MOFO, UltraSAM, FetalCLIP) against two CNN baselines (ResNet50, EfficientNet-V2) and a vision transformer (DINOv3) pretrained on natural images. All models were evaluated under two settings: full fine-tuning and linear probing with a frozen encoder, using 5-fold patient-level cross-validation on a Spanish fetal ultrasound dataset. Testing also included an external African cohort to assess cross-population generalization.
Key results show FetalCLIP achieved the highest F1 scores in the low-data linear probing regime: 0.9261 in-domain and 0.9731 out-of-domain. USFM performed best with full fine-tuning, scoring 0.9476 and 0.9515, respectively. In contrast, MOFO and UltraSAM degraded significantly in both settings, sometimes underperforming the natural-image pretrained DINOv3. The findings underscore that pretraining objectives and data composition heavily influence transferability, with FetalCLIP's contrastive objective proving especially robust for cross-population generalization. This benchmark offers critical guidance for deploying AI in fetal ultrasound, particularly in low-resource and diverse clinical settings where annotated data is scarce.
- FetalCLIP achieved top linear probing F1 of 0.9261 in-domain and 0.9731 out-of-domain.
- USFM led full fine-tuning with F1=0.9476 in-domain and 0.9515 out-of-domain.
- MOFO and UltraSAM underperformed, sometimes worse than natural-image pretrained DINOv3.
Why It Matters
Best-performing models like FetalCLIP enable accurate fetal screening across diverse populations with minimal labeled data.