UniSurgSAM: A Unified Promptable Model for Reliable Surgical Video Segmentation
A new model tackles hallucinations and mask drift in surgical video, enabling surgeons to specify targets by voice or text.
A research team led by Haofeng Liu has introduced UniSurgSAM, a unified Promptable Video Object Segmentation (PVOS) model designed to bring unprecedented reliability to computer-assisted surgery. The model directly addresses critical flaws in existing methods, which are typically limited to a single prompt modality and suffer from optimization interference, hallucinations, and mask drift. UniSurgSAM's breakthrough is its ability to let surgeons dynamically specify anatomical targets or instruments using heterogeneous cues—whether by clicking on the video, speaking a command, or typing a text description—all within a single, cohesive system.
At its core, UniSurgSAM employs a novel decoupled two-stage framework that independently optimizes target initialization and tracking, resolving the interference that plagues coupled models. For reliability, it integrates three key technical designs: presence-aware decoding to model target absence and suppress false positives; boundary-aware long-term tracking to prevent mask drift over extended surgical sequences; and an adaptive state transition mechanism that closes the loop between stages for automatic failure recovery. The team also established a comprehensive multi-modal benchmark from four public surgical datasets to rigorously test the model.
Extensive experiments demonstrate that UniSurgSAM achieves state-of-the-art performance across all prompt modalities and granularities while operating in real time. This performance, validated on a new multi-modal benchmark, positions the model as a practical and robust foundation for next-generation surgical AI assistants. The work is an extended version of a paper accepted at MICCAI 2025, and the code and datasets will be made publicly available to accelerate research in the field.
- UniSurgSAM accepts visual, textual, or audio prompts in a single model, a first for surgical video segmentation.
- Its decoupled two-stage framework and presence-aware decoding suppress hallucinations and prevent mask drift, key failure points in existing AI.
- The model achieves state-of-the-art performance in real-time on a new multi-modal benchmark built from four public surgical datasets.
Why It Matters
This brings reliable, multi-modal AI assistance into the operating room, allowing surgeons to control systems naturally during complex procedures.