Research & Papers

ESICA: A Scalable Framework for Text-Guided 3D Medical Image Segmentation

Outperforms existing methods across CT, MRI, PET, ultrasound, and microscopy.

Deep Dive

Researchers propose ESICA, a scalable text-guided framework for 3D medical image segmentation that lets clinicians specify regions of interest using natural language instead of predefined labels. The framework introduces three key innovations: a similarity matrix-based mask prediction formulation that enhances semantic alignment between text and image features, an efficient decomposed decoder with adapter modules for accurate volumetric decoding, and a two-pass refinement strategy that sharpens boundaries and resolves uncertain regions. ESICA adopts a two-stage training scheme—positive-only pretraining followed by balanced fine-tuning—to improve stability and generalization.

On the CVPR BiomedSegFM benchmark, which spans five imaging modalities (CT, MRI, PET, ultrasound, and microscopy), ESICA achieves state-of-the-art segmentation accuracy. Notably, the compact ESICA4 Lite variant attains similar performance with substantially fewer parameters, yielding a superior efficiency-accuracy trade-off. This work advances text-guided segmentation toward efficient, scalable, and clinically deployable systems. Code will be made publicly available.

Key Points
  • ESICA uses a similarity matrix for mask prediction to better align text and image features.
  • Achieves state-of-the-art accuracy on CVPR BiomedSegFM benchmark across 5 modalities.
  • Compact ESICA4 Lite variant matches full model performance with far fewer parameters.

Why It Matters

Enables clinicians to segment any anatomical region by typing, reducing reliance on rigid label sets.