SurgFusion-Net: Diversified Adaptive Multimodal Fusion Network for Surgical Skill Assessment
New multimodal AI network fuses RGB video, optical flow, and tool masks to evaluate surgeons in real operations.
A large international research team has published a breakthrough paper on arXiv introducing SurgFusion-Net, a Diversified Adaptive Multimodal Fusion Network designed to automate the assessment of surgical skill in real clinical environments. The work directly addresses a critical gap in surgical AI: most existing systems are trained and tested only in controlled 'dry-lab' simulations, creating a significant domain gap when applied to the complexities of live surgery with tissue motion, camera movement, and varied lighting. To bridge this gap, the researchers also contributed two first-of-its-kind clinical datasets—RAH-skill (37 Robot-assisted Hysterectomy videos) and RARP-skill (33 Robot-Assisted Radical Prostatectomy videos)—totaling over 350,000 annotated RGB frames with corresponding optical flow and tool segmentation masks, all labeled with M-GEARS skill metrics.
The core technical innovation is the Divergence Regulated Attention (DRA) module, a novel fusion strategy that intelligently combines information from the three input modalities (RGB, optical flow, tool masks) based on surgical context. DRA uses adaptive dual attention and diversity-promoting multi-head attention to determine which data streams are most relevant for skill evaluation at any given moment. When validated, SurgFusion-Net showed consistent improvements over recent baselines, achieving Spearman's Correlation Coefficient (SCC) gains of 0.02-0.04 on the established JIGSAWS benchmark and more significant gains of 0.0538 and 0.0493 on the new RAH-skill and RARP-skill datasets, respectively. This demonstrates robust performance transfer from simulation to real operating rooms, paving the way for objective analytics in surgical training and quality assurance.
- Introduces two novel clinical datasets: RAH-skill (279,691 frames) and RARP-skill (70,661 frames) with M-GEARS annotations, optical flow, and tool masks.
- Features a new Divergence Regulated Attention (DRA) fusion module that adaptively combines RGB, optical flow, and tool mask data.
- Outperforms prior methods with SCC improvements up to 0.0538, demonstrating effective skill assessment in real surgical settings, not just simulations.
Why It Matters
Enables objective, data-driven evaluation of surgeon proficiency, potentially improving training outcomes and patient safety in robotic-assisted surgery.