Research & Papers

TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots

New framework uses clinical scales and RL to boost therapy chatbot fidelity scores from 0.10 to 0.60.

Deep Dive

A research team from Stanford and collaborating institutions has published a new paper introducing TherapyGym, a comprehensive framework designed to solve a critical problem in AI mental health: current evaluation methods fail to assess what matters clinically. The system evaluates therapy chatbots along two core pillars—fidelity to evidence-based practice (like Cognitive Behavioral Therapy) and safety against therapy-specific risks (like failing to address abuse). To measure fidelity, it automates the established Cognitive Therapy Rating Scale (CTRS), scoring a chatbot's adherence to proper techniques across multi-turn conversations.

Beyond evaluation, TherapyGym serves as a training harness. It uses CTRS and safety scores as rewards in a reinforcement learning (RL) loop, where AI models practice with configurable patient simulations exhibiting diverse symptoms. This training produced significant improvements; models boosted their average expert-rated CTRS score from 0.10 to 0.60. To combat bias in AI-based scoring, the team also released TherapyJudgeBench, a validation set of 116 dialogues with 1,270 expert ratings from licensed clinicians for calibration.

The work represents a major shift from evaluating chatbots on generic fluency to ensuring they are clinically competent and safe. By providing scalable, automated tools grounded in real therapeutic practice, TherapyGym enables the development of AI mental health supports that professionals can trust, paving the way for more responsible deployment in a high-stakes domain.

Key Points
  • TherapyGym introduces automated scoring using the clinical Cognitive Therapy Rating Scale (CTRS), moving beyond generic chat metrics.
  • Models trained with TherapyGym's RL framework improved average CTRS fidelity scores from 0.10 to 0.60 based on expert evaluation.
  • Includes TherapyJudgeBench, a 116-dialogue validation set with 1,270 clinician ratings to audit and calibrate AI judges against human experts.

Why It Matters

Provides the first scalable, clinical-grade framework to build AI therapy tools that are both evidence-based and safer, addressing a major trust gap.