Evaluating Text-based Conversational Agents for Mental Health: A Systematic Review of Metrics, Methods and Usage Contexts
Analysis of 132 studies reveals fragmented evaluation methods and reliance on Western-centric scales.
A major systematic review published on arXiv reveals critical gaps in how AI-powered mental health chatbots are evaluated. Conducted by researchers including Jiangtao Gong, the PRISMA-guided review analyzed 132 studies from a pool of 613 records, with dual-coder extraction achieving substantial agreement (Cohen's kappa = 0.77-0.92). The findings, synthesized across metrics, methods, and usage contexts, expose a fragmented evaluation landscape.
The technical analysis shows a heavy reliance on Western-developed psychometric scales with limited cultural adaptation, raising questions about global applicability. Studies predominantly used small sample sizes and short-term assessments, with weak links established between automated performance metrics (like response reliability) and actual user well-being outcomes. The review classified metrics into CA-centric attributes (safety, empathy) and user-centric outcomes (psychological state, health behavior), noting a need for better alignment between the two.
In context, this review arrives as tools like Woebot, Wysa, and features in platforms like ChatGPT see increased use for mental wellness. The paper argues this ad-hoc evaluation framework risks deploying ineffective or potentially harmful agents. The practical implication is a call for the industry to adopt methodological triangulation—combining automated analysis, standardized scales, and qualitative inquiry—along with longer-term studies and equity-focused measurement to build truly safe and effective mental health AI.
- Analysis of 132 studies found reliance on Western psychometric scales with limited cultural adaptation.
- Research highlights predominance of small, short-term samples and weak links between AI performance and user outcomes.
- Calls for methodological triangulation and equity in measurement to improve evaluation rigor for mental health chatbots.
Why It Matters
As AI chatbots become frontline mental health tools, rigorous, culturally-aware evaluation is critical for safety and efficacy.