Hi-GaTA generates surgical reports from video with 40K minutes of pretraining
New AI model turns surgical videos into automated clinician-grade reports
Surgical video report generation has long been challenging due to the difficulty of aligning dense spatio-temporal video data with language reasoning and the scarcity of high-quality, privacy-preserving datasets. To address this, researchers propose Hi-GaTA (Hierarchical Gated Temporal Aggregation Adapter), a Perception-Alignment-Reasoning framework that efficiently compresses long video sequences into compact, LLM-compatible visual prefix tokens through short-to-long-range temporal aggregation. They also release a benchmark of 214 high-quality simulated surgical videos paired with surgeon-authored reports, and pretrain Sur40k, a surgical-specific ViViT-style video encoder on 40,000 minutes of public surgical videos to capture fine-grained spatio-temporal procedural priors.
Hi-GaTA employs a temporal pyramid with text-conditioned dual cross-attention and improves multi-scale consistency through cross-level gated fusion and an increasing-depth strategy. The LLM backbone is fine-tuned with LoRA for coherent, stylistically consistent report generation under limited supervision. Experiments show the approach achieves the best overall performance against strong multimodal LLM baselines, with ablation studies confirming each component's effectiveness. This work represents a significant step toward automated, objective surgical assessment, potentially reducing documentation time and enhancing surgical training and feedback loops.
- Hi-GaTA compresses long surgical videos into compact visual prefix tokens via hierarchical temporal aggregation
- Sur40k encoder pretrained on 40,000 minutes of public surgical data captures fine-grained spatio-temporal priors
- Outperforms strong multimodal LLM baselines on a benchmark of 214 simulated surgical videos with surgeon-authored reports
Why It Matters
Automating surgical report generation can cut documentation time and improve objective feedback for surgeons.