Quality assessment of brain structural MR images: Comparing generalization of deep learning versus hand-crafted feature-based machine learning methods to new sites
Research on 1,098 brain scans from 17 sites reveals critical generalization challenges for AI in medical imaging.
A new study from researchers including Prabhjot Kaur, John S. Thornton, and Hui Zhang tackles a critical bottleneck in large-scale neuroimaging: automated quality assessment of brain structural MR images. The team compared two prominent approaches—MRIQC, which uses hand-crafted image-quality metrics with traditional machine learning, and CNNQC, a deep learning architecture—on a heterogeneous dataset of 1,098 T1-weighted volumes from 17 different sites. Using a rigorous leave-one-site-out (LOSO) validation approach, they tested how well each method generalized to entirely new scanners and imaging protocols.
The results reveal significant challenges for AI deployment in medical imaging. Both deep learning and traditional machine learning methods struggled to maintain performance when applied to data from unseen sites, highlighting the persistent problem of domain shift in healthcare AI. While MRIQC generally achieved higher accuracy across most unseen sites, CNNQC demonstrated superior sensitivity for detecting poor-quality scans—a crucial metric for clinical applications where missing a flawed scan could bias research findings or clinical estimates.
Despite the generalization challenges, the study suggests CNNQC's deep learning approach may still be preferable for widespread deployment due to its higher computational efficiency and elimination of expensive pre-processing steps. The research underscores that future work must focus specifically on improving cross-site generalizability through techniques like domain adaptation or federated learning. As large-scale neuroimaging studies increasingly rely on multi-site data, developing robust quality assessment tools that work consistently across different scanners and protocols becomes essential for advancing neuroscience research and clinical applications.
- Tested 1,098 brain MRI scans from 17 different sites using leave-one-site-out validation
- Both MRIQC (traditional ML) and CNNQC (deep learning) struggled to generalize to new scanners
- CNNQC showed higher sensitivity for detecting poor-quality scans and better computational efficiency
Why It Matters
Highlights critical barriers to deploying AI in real-world medical imaging where scanner variability is the norm, not the exception.