Image & Video

Colon-Bench: An Agentic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos

Researchers use an AI agent workflow to annotate 528 full-length colonoscopy videos, creating a benchmark 10x larger than existing datasets.

Deep Dive

A research team led by Abdullah Hamdi has developed Colon-Bench, a groundbreaking dataset and agentic workflow designed to overcome the critical bottleneck in medical AI development: the lack of densely annotated, long-form video data. Existing colonoscopy datasets typically focus only on polyp detection, but Colon-Bench uses a sophisticated multi-stage pipeline where AI agents propose temporal segments, track lesions with bounding boxes, and generate visual confirmations, all verified by human experts. The result is an unprecedented resource containing 528 full-length procedure videos annotated with over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of clinical text across 14 distinct lesion types, including ulcers and bleeding.

The team used this new benchmark to rigorously evaluate state-of-the-art Multimodal Large Language Models (MLLMs) like GPT-4V and Claude 3 on medical vision tasks. They found that MLLMs demonstrated surprisingly strong localization capabilities in this domain, even outperforming specialized models like Meta's Segment Anything Model 3 (SAM-3) in some contexts. By analyzing common errors in video-based Visual Question Answering (VQA), the researchers developed a novel 'colon-skill' prompting strategy that boosted zero-shot MLLM performance by up to 9.7%. This work provides both a vital new training resource and a methodology for scaling high-quality annotation in complex medical video, accelerating the development of robust AI assistants for gastroenterologists.

Key Points
  • Uses an AI agent workflow to scalably annotate 528 full-length colonoscopy videos, creating a dataset 10x more comprehensive than predecessors.
  • Contains over 300,000 bounding boxes and 213,000 segmentation masks across 14 lesion categories, enabling rigorous MLLM evaluation.
  • Introduces a 'colon-skill' prompting strategy that improves zero-shot MLLM performance on medical VQA tasks by up to 9.7%.

Why It Matters

Provides the large-scale, annotated data needed to train accurate AI diagnostic assistants, potentially improving early detection of colorectal cancer.