Research & Papers

Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology

arXiv cs.CV May 07, 2026

⚡Top-3 diagnostic accuracy falls from 42% on benchmarks to 24% on clinical images.

Deep Dive

A new study from researchers at Yale and other institutions systematically evaluates how well multimodal large language models (MLLMs) perform on real-world dermatology tasks compared to standard public benchmarks. They tested four open-weight models (InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4, and MedGemma-4B-Instruct) and one commercial model (GPT-4.1) on two clinically relevant tasks: differential diagnosis generation and severity-based triage. The evaluation used three public dermatology datasets plus a large retrospective multi-site hospital consultation cohort comprising 5,811 cases and 46,405 clinical images.

On public benchmarks, GPT-4.1's top-3 diagnostic accuracy reached 42.25%, while the best open-weight model scored 26.55%. However, when tested on real-world consultation images alone, GPT-4.1's accuracy dropped to 24.65%, and open-weight models fell to between 1.5% and 13.35%. Incorporating clinical context (e.g., patient history, lesion description) improved performance, lifting GPT-4.1 to 38.93% and open models up to 28.75%. Notably, model outputs were highly sensitive to incomplete or erroneous consultation context. For severity triage, models achieved moderate sensitivity (above 60%), suggesting potential for screening but insufficient reliability for clinical deployment. The authors conclude that benchmark performance substantially overestimates real-world clinical capability.

Key Points

GPT-4.1 achieved 42.25% top-3 accuracy on public benchmarks but only 24.65% on real-world clinical images alone.
Open-weight models (e.g., InternVL-Chat v1.5) scored as low as 1.5% top-3 accuracy on real-world cases without clinical context.
Adding clinical context improved GPT-4.1 to 38.93% but made outputs highly sensitive to input errors; triage sensitivity was >60% but not deployment-ready.

Why It Matters

Exposes the critical gap between benchmark hype and real-world clinical reliability for AI in dermatology.

Read Original Article

Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology

Why It Matters

Stay Ahead in AI