AI Safety

A Scoping Review of LLM-as-a-Judge in Healthcare and the MedJUDGE Framework

73.5% of studies skip bias testing—MedJUDGE proposes a safety fix.

Deep Dive

A new scoping review published on arXiv (April 3, 2026) by Chenyu Li and 15 co-authors from multiple institutions systematically examined the use of LLM-as-a-Judge (LaaJ) in healthcare. After screening 11,727 studies across six databases from January 2020 to January 2026, the team included 49 studies for analysis. The review found that LaaJ adoption is dominated by evaluation and benchmarking applications (75.5% of studies), pointwise scoring (85.7%), and GPT-family models as judges (73.5%). Despite growing use, validation rigor is alarmingly low: among 36 studies with human involvement, the median number of expert validators was just 3, and 13 studies (26.5%) used none. Risk-of-bias testing was absent in 73.5% of studies, only one study examined demographic fairness, and none assessed temporal stability or patient context. Only one study (2%) reached production, with four (8.2%) at prototype stage.

To address these critical gaps, the researchers propose MedJUDGE (Medical Judge Utility, De-biasing, Governance and Evaluation), a risk-stratified three-pillar framework organized around validity, safety, and accountability across clinical risk tiers. The framework aims to provide deployment-oriented evaluation guidance for healthcare LaaJ systems, particularly addressing the governance gap where shared training data or architectures between judges and evaluated systems can lead to inherited blind spots and agreement metrics that mask clinically significant errors. MedJUDGE offers a structured path toward safer, fairer, and more reliable LLM evaluation in clinical settings.

Key Points
  • 73.5% of 49 reviewed studies lack risk-of-bias testing; median expert validators is just 3
  • GPT-family models serve as judges in 73.5% of studies, creating model monoculture risks
  • Only 2% of LaaJ systems reach production; MedJUDGE framework proposes 3-pillar safety evaluation

Why It Matters

Without rigorous validation, LLM-as-a-Judge in healthcare risks missing dangerous errors—MedJUDGE offers a safety roadmap.