Research & Papers

MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies

New benchmark shows GPT-5.4 and Gemini 3.1 Pro get worse when given medical tools, lacking spatial grounding.

Deep Dive

A research team from Imperial College London and Tsinghua University has published a groundbreaking paper introducing MedOpenClaw, a new framework designed to evaluate AI agents in realistic medical imaging workflows. The system is an auditable runtime that allows vision-language models (VLMs) to operate dynamically within standard medical software like 3D Slicer, moving beyond the oversimplified, static 2D image evaluations common today. To test these agents, the team created MedFlowBench, a comprehensive benchmark that uses full, uncurated 3D medical studies—including multi-sequence brain MRI and lung CT/PET scans—to simulate the actual task of navigating volumes and gathering evidence for a diagnosis.

The initial results from MedFlowBench revealed a surprising and critical insight. While state-of-the-art models like OpenAI's GPT-5.4 and Google's Gemini 3.1 Pro could successfully navigate a basic image viewer to solve study-level tasks, their performance paradoxically degraded when they were given access to professional support tools. The research indicates this failure stems from a fundamental lack of precise spatial grounding; the models cannot effectively correlate abstract tool functions with specific locations in a 3D volume. This finding exposes a major gap between current AI capabilities and the needs of interactive clinical diagnostics.

By bridging the gap between static perception and dynamic workflow, MedOpenClaw and MedFlowBench establish the first reproducible foundation for developing auditable, full-study medical imaging agents. This work shifts the focus from simple image recognition to evaluating an AI's ability to reason and act within a complex, tool-enabled environment—a necessary step toward creating assistants that can genuinely augment radiologists rather than just answer questions about preselected slices.

Key Points
  • MedOpenClaw is an auditable runtime letting AI agents operate in tools like 3D Slicer, moving beyond static 2D image analysis.
  • The companion MedFlowBench tests agents on full 3D studies (brain MRI, lung CT/PET), revealing a key flaw in top models like GPT-5.4.
  • Performance of Gemini 3.1 Pro and GPT-5.4 degrades with tool access due to poor spatial grounding, highlighting a critical research gap.

Why It Matters

This exposes a fundamental weakness in current medical AI: models can't effectively use professional tools in 3D space, delaying real clinical integration.