MedFM-Robust benchmark tests reliability of medical AI models
New benchmark evaluates GPT-4o, LLaVA-Med, and SAM variants under real-world clinical conditions.
A team led by Xiangxiang Cui has released MedFM-Robust, a standardized benchmark designed to stress-test the robustness of medical foundation models (MedFMs) before they reach clinical settings. The benchmark targets two major categories of MedFMs: Medical Vision-Language Models (Med-VLMs) and segmentation foundation models. Med-VLMs evaluated include specialized models like LLaVA-Med and MedGemma, as well as general-purpose models such as GPT-4o and Gemini, all tested on tasks like visual question answering (VQA), report generation, and visual grounding. On the segmentation side, the benchmark assesses SAM-based adaptations like SAM-Med2D and MedSAM.
The work, accepted at MICCAI2026, introduces challenging robustness metrics that simulate real-world conditions: noisy inputs, domain shifts across hospitals, and adversarial perturbations. By quantifying how these models degrade under stress, MedFM-Robust provides critical insights for clinicians and developers. Early results reveal significant performance drops in even top-tier Med-VLMs under domain shift, while SAM-based segmenters show surprising resilience to noise but vulnerability to adversarial attacks. The benchmark is open-source and available on arXiv, aiming to drive safer deployment of AI in healthcare.
- Evaluates Med-VLMs (LLaVA-Med, MedGemma, GPT-4o, Gemini) on VQA, report gen, and visual grounding.
- Tests segmentation models SAM-Med2D and MedSAM under noise, domain shift, and adversarial inputs.
- Accepted at MICCAI2026; open-source benchmark to standardize clinical robustness evaluation.
Why It Matters
Before deploying medical AI, we need to know how it fails—this benchmark provides that critical safety check.