Research & Papers

Multi-Modal Monocular Endoscopic Depth and Pose Estimation with Edge-Guided Self-Supervision

New AI model uses edge detection and shading analysis to map colonoscopy videos in 3D without labeled data.

Deep Dive

A research team from University College London and the Wellcome/EPSRC Centre for Interventional and Surgical Sciences has introduced PRISM (Pose-Refinement with Intrinsic Shading and edge Maps), a novel self-supervised learning framework for 3D reconstruction from monocular colonoscopy videos. The system addresses a critical challenge in medical AI: creating accurate depth and pose estimation models without requiring expensive, hard-to-obtain labeled in-vivo datasets.

The technical approach is multi-modal, combining two key innovations. First, it uses learning-based edge detectors (like DexiNed or HED) trained to capture thin, high-frequency anatomical boundaries to provide structural guidance. Second, it employs an intrinsic decomposition module to separate shading from reflectance in images, allowing the model to exploit shading cues—a reliable signal in the complex, deformable colon environment—for depth estimation. This edge-guided, shading-aware method enables the model to learn from unlabeled real-world colonoscopy videos directly.

The research yielded two significant practical insights from extensive ablation studies. First, self-supervised training on real clinical data consistently outperformed supervised training on realistic synthetic phantom data, demonstrating that domain realism is more valuable than perfect ground truth labels. Second, the team identified video frame rate as an extremely important factor; optimal, dataset-specific frame sampling is necessary to generate high-quality training sequences for learning coherent motion and geometry. These findings establish new best practices for developing surgical navigation AI. The work has been early-accepted for presentation at IPCAI 2026, a premier conference in computer-assisted intervention.

Key Points
  • PRISM uses edge detection and luminance decoupling for self-supervised 3D mapping from 2D colonoscopy videos.
  • Ablation studies show training on real clinical data beats synthetic data, prioritizing domain realism over perfect labels.
  • Video frame rate is identified as a critical factor, requiring dataset-specific sampling for optimal training quality.

Why It Matters

Enables more accurate AI-assisted colonoscopy navigation, potentially reducing missed lesions and incomplete exams by improving 3D spatial awareness.