Research & Papers

VERT: Reliable LLM Judges for Radiology Report Evaluation

A new LLM-based metric fine-tuned on 1,300 samples improves correlation with radiologist judgments by up to 25%.

Deep Dive

A new research paper introduces VERT, a method for using large language models (LLMs) as reliable judges for evaluating the quality of radiology reports. Developed by researchers including Federica Bologna, the study conducted a thorough analysis comparing existing LLM-as-a-judge metrics like RadFact and GREEN against VERT across two expert-annotated datasets, RadEval and RaTE-Eval. The results show VERT improves correlation with radiologist judgments by up to 11.7% relative to the previous best metric, demonstrating a significant step forward in automated medical report assessment.

Crucially, the research highlights that 'lightweight adaptation' through fine-tuning can yield massive gains. By fine-tuning the open-source Qwen3 30B model on just 1,300 training samples from the RaTE-Eval dataset, performance improved by up to 25%. This parameter-efficient approach also delivered a staggering 37.2x reduction in inference time, making high-quality automated evaluation far more practical. The study also performed a systematic error analysis to understand where these AI judges align with or diverge from human experts, providing a clearer roadmap for future improvements in this critical domain.

Key Points
  • VERT, a new LLM-based evaluation metric, improves correlation with radiologist judgments by up to 11.7% over the previous best method (GREEN).
  • Fine-tuning the Qwen3 30B model with only 1,300 samples boosted performance by 25% and reduced inference time by 37.2 times.
  • The study validates LLMs as reliable judges for radiology reports across multiple imaging modalities and body anatomies, not just chest X-rays.

Why It Matters

Enables faster, scalable, and more consistent quality assurance for critical medical documentation, aiding radiologists and healthcare systems.