Robotics

SurgFusion-Net: Diversified Adaptive Multimodal Fusion Network for Surgical Skill Assessment

arXiv cs.RO March 03, 2026

⚡New multimodal AI network fuses RGB video, optical flow, and tool masks to evaluate surgeons in real operations.

Deep Dive

A large international research team has published a breakthrough paper on arXiv introducing SurgFusion-Net, a Diversified Adaptive Multimodal Fusion Network designed to automate the assessment of surgical skill in real clinical environments. The work directly addresses a critical gap in surgical AI: most existing systems are trained and tested only in controlled 'dry-lab' simulations, creating a significant domain gap when applied to the complexities of live surgery with tissue motion, camera movement, and varied lighting. To bridge this gap, the researchers also contributed two first-of-its-kind clinical datasets—RAH-skill (37 Robot-assisted Hysterectomy videos) and RARP-skill (33 Robot-Assisted Radical Prostatectomy videos)—totaling over 350,000 annotated RGB frames with corresponding optical flow and tool segmentation masks, all labeled with M-GEARS skill metrics.

The core technical innovation is the Divergence Regulated Attention (DRA) module, a novel fusion strategy that intelligently combines information from the three input modalities (RGB, optical flow, tool masks) based on surgical context. DRA uses adaptive dual attention and diversity-promoting multi-head attention to determine which data streams are most relevant for skill evaluation at any given moment. When validated, SurgFusion-Net showed consistent improvements over recent baselines, achieving Spearman's Correlation Coefficient (SCC) gains of 0.02-0.04 on the established JIGSAWS benchmark and more significant gains of 0.0538 and 0.0493 on the new RAH-skill and RARP-skill datasets, respectively. This demonstrates robust performance transfer from simulation to real operating rooms, paving the way for objective analytics in surgical training and quality assurance.

Key Points

Introduces two novel clinical datasets: RAH-skill (279,691 frames) and RARP-skill (70,661 frames) with M-GEARS annotations, optical flow, and tool masks.
Features a new Divergence Regulated Attention (DRA) fusion module that adaptively combines RGB, optical flow, and tool mask data.
Outperforms prior methods with SCC improvements up to 0.0538, demonstrating effective skill assessment in real surgical settings, not just simulations.

Why It Matters

Enables objective, data-driven evaluation of surgeon proficiency, potentially improving training outcomes and patient safety in robotic-assisted surgery.

Read Original Article

SurgFusion-Net: Diversified Adaptive Multimodal Fusion Network for Surgical Skill Assessment

Why It Matters

Stay Ahead in AI