SSP-SAM: SAM with Semantic-Spatial Prompt for Referring Expression Segmentation
New framework adds semantic-spatial prompts to SAM, enabling precise object selection via natural language commands.
A research team has introduced SSP-SAM, a novel framework that significantly enhances Meta's foundational Segment Anything Model (SAM) by bridging the gap between visual segmentation and natural language understanding. The core innovation is a Semantic-Spatial Prompt (SSP) encoder, which integrates lightweight visual and linguistic "attention adapters." These adapters work in tandem to highlight the most relevant objects in an image and the most discriminative words in a text prompt, creating a precise, language-guided instruction for SAM. This allows users to simply describe a target—like "the dog chasing the frisbee"—and get an accurate pixel-level mask in return, a task known as Referring Expression Segmentation (RES).
SSP-SAM's design is elegantly efficient, building directly on SAM's frozen architecture without requiring full retraining of the massive base model. This approach not only preserves SAM's powerful, general-purpose segmentation capabilities but also extends them into the linguistic domain. Impressively, the framework naturally handles the more challenging Generalized RES (GRES) setting, where a query might refer to zero, one, or multiple objects, without any special modifications. Extensive testing on standard RES and GRES benchmarks confirms its superiority, showing strong performance even at very strict precision thresholds. Further evaluation on the PhraseCut dataset demonstrates improved capability in open-vocabulary scenarios, outperforming existing state-of-the-art methods. The code and model checkpoints are publicly available, enabling immediate experimentation and application by the developer community.
- Adds a Semantic-Spatial Prompt encoder to Meta's SAM, enabling precise object segmentation via natural language descriptions.
- Uses visual & linguistic 'attention adapters' to highlight key image regions and text phrases, creating high-quality prompts for SAM.
- Achieves strong benchmark performance, excels in open-vocabulary tasks on PhraseCut, and naturally supports complex multi-object queries.
Why It Matters
Moves AI from generic image segmentation to understanding and acting on specific language commands, a key step for more intuitive human-computer interaction.