Skill-Evolving Grounded Reasoning for Free-Text Promptable 3D Medical Image Segmentation
New 'SEER' system reduces performance variance by 81.94% when doctors describe anatomy differently.
A research team led by Tongrui Zhang and Chenhui Wang has introduced SEER (Skill-Evolving grounded Reasoning), a breakthrough framework designed to solve a major problem in AI-assisted medical imaging. Current 'promptable' segmentation models, which allow doctors to highlight anatomy using free-text commands like "segment the left ventricle," are notoriously brittle. Minor changes in phrasing—say, "outline the main heart chamber"—can cause catastrophic performance drops, even though the clinical intent is identical. SEER directly addresses this by enforcing semantic consistency through a multi-step reasoning process before any pixel-level analysis begins.
The core innovation is a two-part system. First, SEER constructs an 'evidence-aligned target representation' by running a vision-language reasoning chain. This chain explicitly verifies the doctor's clinical request against anatomical evidence extracted from the 3D scan, ensuring the instruction is grounded in visual reality. Second, the framework features 'SEER-Loop,' a dynamic skill-evolving strategy. SEER-Loop distills successful reasoning paths from complex cases into reusable 'skill artifacts,' which are then integrated back into the model to improve its handling of diverse future expressions. This creates a self-refining system that gets more robust over time.
The results, validated on the team's new SEER-Trace benchmark dataset, are significant. Under controlled linguistic perturbations, SEER reduced performance variance by 81.94% compared to state-of-the-art baselines, meaning its output is far more reliable regardless of how a request is worded. It also improved the worst-case segmentation accuracy (measured by Dice score) by 18.60%. This represents a major leap toward clinically dependable AI, where tools must understand intent, not just keywords, to be safely integrated into diagnostic workflows.
- Reduces performance variance by 81.94% under linguistic perturbations, making AI output consistent regardless of phrasing.
- Improves worst-case segmentation accuracy (Dice score) by 18.60% by using reasoning chains to verify intent against anatomy.
- Introduces a self-improving 'SEER-Loop' that distills successful case reasoning into reusable skills for future inference.
Why It Matters
Moves AI from brittle keyword matching to understanding clinical intent, a prerequisite for reliable use in real hospitals.