AffordTissue: Dense Affordance Prediction for Tool-Action Specific Tissue Interaction
New surgical AI framework predicts exactly where tools should interact with tissue, cutting prediction error by 66%.
A research team from Johns Hopkins University and Shanghai Jiao Tong University has introduced AffordTissue, a breakthrough AI framework designed to bring surgical automation closer to clinical reality. The system addresses a critical gap in current surgical AI: while existing models can mimic dexterous control, they lack predictability about where instruments should interact with tissue surfaces. AffordTissue solves this by generating dense heatmaps that predict tool-action specific affordance regions—essentially showing surgeons and robotic systems exactly where on tissue each surgical action should occur.
The framework combines three key components: a temporal vision encoder that captures tool motion and tissue dynamics across multiple viewpoints, language conditioning that enables generalization across diverse instrument-action pairs, and a DiT-style decoder for generating precise affordance predictions. The researchers established the first tissue affordance benchmark by curating and annotating 15,638 video clips from 103 cholecystectomy procedures, covering six unique tool-action pairs involving four instruments (hook, grasper, scissors, clipper) and their associated tasks including dissection, grasping, clipping, and cutting.
Experiments demonstrate substantial improvements over existing approaches, with AffordTissue achieving 20.6 pixels in Average Symmetric Surface Distance (ASSD) compared to 60.2 pixels for the Molmo-VLM baseline—a 66% reduction in prediction error. This shows that task-specific architectures can outperform large-scale foundation models for dense surgical affordance prediction. The system's explicit spatial reasoning capabilities provide clear guidance for safe surgical automation, potentially enabling early safe-stop mechanisms when instruments deviate outside predicted safe zones.
- Predicts surgical tool interaction zones with 20.6px accuracy, 3x better than Molmo-VLM's 60.2px
- Trained on 15,638 annotated video clips from 103 gallbladder removal procedures
- Combines temporal vision encoding, language conditioning, and DiT-style decoding for precise heatmaps
Why It Matters
Enables safer surgical automation by predicting exact tissue interaction zones, potentially reducing errors in robotic-assisted procedures.