Image & Video

CTSCAN: Evaluation Leakage in Chest CT Segmentation and a Reproducible Patient-Disjoint Benchmark

arXiv eess.IV April 20, 2026

⚡New research reveals chest CT segmentation models lose 69% of their performance when properly tested.

Deep Dive

A new benchmark called CTSCAN, created by researcher Anton Ivchenko, exposes a critical flaw in how medical AI models for chest CT segmentation are typically evaluated. The research reveals that when training and testing data inadvertently mix different CT slices from the same patient—a common practice—performance metrics become severely inflated. In controlled experiments using an FPN with EfficientNet-B0 architecture, models scored 0.6665 in foreground Dice under the flawed "slice-mixed" protocol, but plummeted to just 0.2066 when evaluated with proper "patient-disjoint" splits. This represents a staggering 69% relative performance drop, showing that much of what appears to be AI capability is actually data leakage.

CTSCAN addresses this problem by providing a reproducible research stack that aggregates 89 cases from three public datasets (PleThora, MedSeg SIRM, and LongCIU) with corrected patient-disjoint partitions. The benchmark includes deterministic split manifests, explicit weak-supervision controls, and scripted multi-seed protocol sweeps to ensure evaluations are both rigorous and repeatable. By packaging everything needed for proper validation—from data to figure generation—CTSCAN gives researchers a standardized foundation for developing clinically relevant segmentation models that can generalize to new patients rather than just memorizing training data.

Key Points

Patient data leakage inflates chest CT segmentation performance by 69% (Dice drops from 0.6665 to 0.2066)
CTSCAN benchmark aggregates 89 cases from 3 datasets with corrected patient-disjoint splits
Provides complete reproducible research stack including split manifests, protocol sweeps, and figure generation

Why It Matters

Ensures medical AI evaluations reflect real clinical performance, preventing deployment of models that can't generalize to new patients.

Read Original Article

CTSCAN: Evaluation Leakage in Chest CT Segmentation and a Reproducible Patient-Disjoint Benchmark

Why It Matters

Stay Ahead in AI