Research & Papers

ESICA: A Scalable Framework for Text-Guided 3D Medical Image Segmentation

arXiv cs.CV April 29, 2026

⚡Outperforms existing methods across CT, MRI, PET, ultrasound, and microscopy.

Deep Dive

Researchers propose ESICA, a scalable text-guided framework for 3D medical image segmentation that lets clinicians specify regions of interest using natural language instead of predefined labels. The framework introduces three key innovations: a similarity matrix-based mask prediction formulation that enhances semantic alignment between text and image features, an efficient decomposed decoder with adapter modules for accurate volumetric decoding, and a two-pass refinement strategy that sharpens boundaries and resolves uncertain regions. ESICA adopts a two-stage training scheme—positive-only pretraining followed by balanced fine-tuning—to improve stability and generalization.

On the CVPR BiomedSegFM benchmark, which spans five imaging modalities (CT, MRI, PET, ultrasound, and microscopy), ESICA achieves state-of-the-art segmentation accuracy. Notably, the compact ESICA4 Lite variant attains similar performance with substantially fewer parameters, yielding a superior efficiency-accuracy trade-off. This work advances text-guided segmentation toward efficient, scalable, and clinically deployable systems. Code will be made publicly available.

Key Points

ESICA uses a similarity matrix for mask prediction to better align text and image features.
Achieves state-of-the-art accuracy on CVPR BiomedSegFM benchmark across 5 modalities.
Compact ESICA4 Lite variant matches full model performance with far fewer parameters.

Why It Matters

Enables clinicians to segment any anatomical region by typing, reducing reliance on rigid label sets.

Read Original Article

ESICA: A Scalable Framework for Text-Guided 3D Medical Image Segmentation

Why It Matters

Stay Ahead in AI