Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology
Top-3 diagnostic accuracy falls from 42% on benchmarks to 24% on clinical images.
A new study from researchers at Yale and other institutions systematically evaluates how well multimodal large language models (MLLMs) perform on real-world dermatology tasks compared to standard public benchmarks. They tested four open-weight models (InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4, and MedGemma-4B-Instruct) and one commercial model (GPT-4.1) on two clinically relevant tasks: differential diagnosis generation and severity-based triage. The evaluation used three public dermatology datasets plus a large retrospective multi-site hospital consultation cohort comprising 5,811 cases and 46,405 clinical images.
On public benchmarks, GPT-4.1's top-3 diagnostic accuracy reached 42.25%, while the best open-weight model scored 26.55%. However, when tested on real-world consultation images alone, GPT-4.1's accuracy dropped to 24.65%, and open-weight models fell to between 1.5% and 13.35%. Incorporating clinical context (e.g., patient history, lesion description) improved performance, lifting GPT-4.1 to 38.93% and open models up to 28.75%. Notably, model outputs were highly sensitive to incomplete or erroneous consultation context. For severity triage, models achieved moderate sensitivity (above 60%), suggesting potential for screening but insufficient reliability for clinical deployment. The authors conclude that benchmark performance substantially overestimates real-world clinical capability.
- GPT-4.1 achieved 42.25% top-3 accuracy on public benchmarks but only 24.65% on real-world clinical images alone.
- Open-weight models (e.g., InternVL-Chat v1.5) scored as low as 1.5% top-3 accuracy on real-world cases without clinical context.
- Adding clinical context improved GPT-4.1 to 38.93% but made outputs highly sensitive to input errors; triage sensitivity was >60% but not deployment-ready.
Why It Matters
Exposes the critical gap between benchmark hype and real-world clinical reliability for AI in dermatology.