Research & Papers

Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis

New method uses AI judges to evaluate chatbot safety for psychosis with 75% human alignment.

Deep Dive

A research team including May Lynn Reese, Markela Zeneli, and Mindy Ng has published a paper introducing a scalable method for evaluating the safety of Large Language Model (LLM) responses to users experiencing psychosis. The work addresses a critical gap: as people increasingly turn to general-purpose chatbots like ChatGPT for mental health support, there is emerging evidence of significant risks, particularly the potential for AI to reinforce delusions and hallucinations. Existing evaluations lack both clinical validation and scalability.

To solve this, the researchers first developed and validated seven specific safety criteria informed by clinical expertise. They then constructed a human-consensus dataset to serve as a gold standard. Their core innovation was testing automated assessment using an 'LLM-as-a-Judge' (a single model evaluator) and an 'LLM-as-a-Jury' (taking a majority vote from several models). They tested models including Gemini, Qwen, and Kimi in the judge role.

The results, presented at IASEAI 2026, show strong alignment between the AI judges and human experts. The best-performing single LLM judge (Gemini) achieved a Cohen's kappa (κ) score of 0.75 against the human consensus, indicating substantial agreement. The LLM-as-a-Jury approach scored slightly lower at κ=0.74. This demonstrates that automated evaluation can closely match expert human judgment, a breakthrough for making rigorous safety testing feasible at scale.

This research provides a foundational framework for the AI industry to systematically test and mitigate harms in sensitive domains like mental health. It moves beyond simple content moderation to clinically-grounded assessment, enabling developers to proactively identify dangerous response patterns before models are deployed to vulnerable populations.

Key Points
  • Developed 7 clinician-informed safety criteria to evaluate AI responses to psychosis, moving beyond generic content filters.
  • Best LLM judge (Gemini) achieved 75% agreement (κ=0.75) with human expert consensus on response safety.
  • LLM-as-a-Jury (majority vote of several models) performed slightly worse (κ=0.74) than the best single judge.

Why It Matters

Enables scalable, clinical safety testing for AI mental health tools, crucial for protecting vulnerable users from harmful interactions.