Generates two question types?

RADS-style from schemas and report-derived from LLM with verification.

Applied to four cancer cohorts, producing a contamination-controlled benchmark without manual annotation?

Applied to four cancer cohorts, producing a contamination-controlled benchmark without manual annotation.

Blinding VLMs on Lung CT achieved accuracy equal to or exceeding sighted performance, indicating dataset-specific visual reliance?

Blinding VLMs on Lung CT achieved accuracy equal to or exceeding sighted performance, indicating dataset-specific visual reliance.

Research & Papers

New Benchmark Reveals AI Vision Models Can 'Cheat' on 3D Cancer Scans

arXiv cs.CV June 03, 2026

⚡AI models ace lung CT without seeing the image—benchmark exposes data contamination.

Deep Dive

A team led by Bo Liu proposes an automated agent-driven pipeline that generates multiple-choice visual question answering (VQA) datasets directly from paired private radiology reports and 3D oncology imaging. The pipeline produces two complementary question types: RADS-style questions derived deterministically from clinician-defined reporting schemas, and report-derived questions generated by an LLM from radiologist findings, then verified against the source report. Applied to four in-house cancer cohorts (liver, lung, brain, prostate), the benchmark is instance-contamination-controlled and requires no per-question human annotation.

Zero-shot evaluation of six vision-language models (VLMs) reveals no dominant model and substantial headroom across all cells. More importantly, a blind ablation—where the image is removed—shows that visual reliance is highly dataset-specific. For liver Report-derived questions, vision is genuinely required; but for Lung CT, models achieved accuracy equal to or exceeding their sighted performance when blinded. This indicates that even private clinical data does not guarantee a contamination-controlled read of visual capability.

The authors release the pipeline as an open agent skill for in-house redeployment. The findings underscore the need for more rigorous benchmarks that truly test visual reasoning in medical AI.

Key Points

Generates two question types: RADS-style from schemas and report-derived from LLM with verification.
Applied to four cancer cohorts, producing a contamination-controlled benchmark without manual annotation.
Blinding VLMs on Lung CT achieved accuracy equal to or exceeding sighted performance, indicating dataset-specific visual reliance.

Why It Matters

Ensures trustworthy evaluation of medical AI, crucial for deploying models in clinical diagnostics.

Read Original Article

New Benchmark Reveals AI Vision Models Can 'Cheat' on 3D Cancer Scans

Why It Matters

Related Articles

🚀 Stay Ahead in AI