Hybrid transformers beat CNNs and vision-language models in retinal screening study
Twelve architectures tested on 28 retinal diseases — attention models win.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A new study systematically benchmarks 12 deep learning architectures across four model families—CNNs, vision transformers, hybrid CNN-transformers, and vision-language models—for multi-disease retinal screening. Using the Retinal Fundus Multi-disease Image Dataset (RFMiD) covering 28 disease classes, the researchers evaluated binary screening (any disease vs. normal) and multi-label classification. All models achieved AUC above 84% for binary screening, but attention-based models—specifically SwinTiny, CoAtNet0, and MaxViTTiny—delivered the highest performance. Hybrid CNN-transformer backbones also excelled in macro and micro F1 scores for multi-label classification. Vision-language models such as CLIP ViT-B/16 and SigLIP-Base384 were competitive with CNN baselines but fell short of the top transformers and hybrids.
External validation on the Messidor-2 dataset for referable diabetic retinopathy confirmed the trend: hybrid and transformer models again led, with AUC ranging from 66.8% to 84.7%. The authors emphasize that standardized training and calibration protocols make these results reproducible for real-world deployment. The findings offer clear guidance for selecting AI architectures in automated retinal screening tools, especially in multi-disease settings where accuracy across many conditions is critical. This work, accepted at ICMHI 2026, moves the field closer to clinically viable AI that can handle both common and rare retinal diseases under domain shift.
- SwinTiny, CoAtNet0, and MaxViTTiny achieved best binary and multi-label performance on RFMiD dataset (28 disease classes).
- All 12 models had AUC above 84% for binary screening; vision-language models (CLIP, SigLIP) were competitive but not top.
- External validation on Messidor-2 showed AUC up to 84.7% for hybrid/transformer models, confirming robustness under domain shift.
Why It Matters
Helps guide selection of AI architectures for clinical retinal screening tools, boosting accuracy across 28 diseases.