Research & Papers

New Benchmark Reveals AI Vision Models Can 'Cheat' on 3D Cancer Scans

AI models ace lung CT without seeing the image—benchmark exposes data contamination.

Deep Dive

A team led by Bo Liu proposes an automated agent-driven pipeline that generates multiple-choice visual question answering (VQA) datasets directly from paired private radiology reports and 3D oncology imaging. The pipeline produces two complementary question types: RADS-style questions derived deterministically from clinician-defined reporting schemas, and report-derived questions generated by an LLM from radiologist findings, then verified against the source report. Applied to four in-house cancer cohorts (liver, lung, brain, prostate), the benchmark is instance-contamination-controlled and requires no per-question human annotation.

Zero-shot evaluation of six vision-language models (VLMs) reveals no dominant model and substantial headroom across all cells. More importantly, a blind ablation—where the image is removed—shows that visual reliance is highly dataset-specific. For liver Report-derived questions, vision is genuinely required; but for Lung CT, models achieved accuracy equal to or exceeding their sighted performance when blinded. This indicates that even private clinical data does not guarantee a contamination-controlled read of visual capability.

The authors release the pipeline as an open agent skill for in-house redeployment. The findings underscore the need for more rigorous benchmarks that truly test visual reasoning in medical AI.

Key Points
  • Generates two question types: RADS-style from schemas and report-derived from LLM with verification.
  • Applied to four cancer cohorts, producing a contamination-controlled benchmark without manual annotation.
  • Blinding VLMs on Lung CT achieved accuracy equal to or exceeding sighted performance, indicating dataset-specific visual reliance.

Why It Matters

Ensures trustworthy evaluation of medical AI, crucial for deploying models in clinical diagnostics.