Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models
A new AI model uses Vision Transformers and LSTMs to track instrument exchanges in surgery with 84% accuracy.
A research team from institutions including Charité – Universitätsmedizin Berlin has developed a novel AI system for automatically detecting and classifying surgical instrument handovers in operating room videos. The model, detailed in a paper submitted to CVPR 2026, tackles the challenging problem of monitoring complex surgical workflows where frequent occlusions and background clutter make manual tracking difficult. It uses a Vision Transformer (ViT) to extract spatial features from video frames and a unidirectional Long Short-Term Memory (LSTM) network to analyze temporal sequences, creating a unified multi-task framework that jointly predicts handover occurrence and direction.
In experiments on a dataset of kidney transplant procedures, the system demonstrated strong performance, achieving an F1-score of 0.84 for detecting when a handover occurs and a mean F1-score of 0.72 for classifying the direction of the transfer (e.g., surgeon to assistant). This outperformed both a single-task variant and a VideoMamba-based baseline for direction prediction. To address the critical need for trust in medical AI, the researchers employed Layer-CAM (Class Activation Mapping) attribution techniques to visualize which spatial regions of the video—such as hands and instruments—were driving the model's decisions, making its reasoning more interpretable to surgical teams.
The research represents a significant step toward automated surgical workflow analysis. By providing reliable, event-level monitoring of instrument exchanges, the system could help reduce procedural errors, standardize training, and create more efficient operating room environments. The team's focus on interpretability through visualization techniques is particularly important for clinical adoption, as it allows surgeons to understand and verify the AI's conclusions rather than treating them as a black box.
- Achieves 0.84 F1-score for detecting surgical instrument handovers in video, a key metric for accuracy in imbalanced datasets.
- Uses a hybrid Vision Transformer (ViT) and LSTM architecture to jointly model spatial features and temporal sequences in a unified framework.
- Employs Layer-CAM visualization to highlight hand-instrument interaction cues, making the model's decisions interpretable for clinical validation.
Why It Matters
Automates surgical workflow analysis to improve operating room efficiency, reduce errors, and provide data for training and standardization.