Research & Papers

Language Shapes Mental Health Evaluations in Large Language Models

New research reveals LLMs produce systematically different mental health evaluations based on prompt language.

Deep Dive

A new study by researchers Jiayi Xu and Xiyang Hu reveals that large language models (LLMs) like OpenAI's GPT-4o and Alibaba's Qwen3 produce systematically different mental health evaluations based solely on the language of the prompt. The research, published on arXiv, demonstrates that when assessing mental health stigma using validated scales, both models consistently generated responses indicating higher levels of social, self, and professional stigma when the prompt was written in Chinese compared to identical prompts in English.

These evaluative differences translated into concrete impacts on downstream decision-making tasks. In a binary task designed to detect stigmatizing content, the models showed lower sensitivity—meaning they were worse at correctly identifying stigma—when operating under Chinese prompts. More critically, in a depression severity classification task, the models made more underestimation errors with Chinese prompts, systematically predicting lower severity levels than they did for the same cases presented in English. This suggests the language context isn't just changing wording; it's shifting the models' internal decision thresholds.

The findings point to a significant, embedded bias where the cultural and linguistic data a model is trained on directly influences its evaluative outputs in sensitive domains. This isn't merely a translation issue but a fundamental difference in how the models process and weigh information based on linguistic cues. For global applications in healthcare and psychology, this variability undermines the reliability and fairness of AI-assisted diagnostics and support, raising urgent questions about calibration and deployment standards.

Key Points
  • GPT-4o and Qwen3 showed higher mental health stigma scores across all measured scales (social, self, professional) when prompted in Chinese vs. English.
  • In a stigma detection task, model sensitivity was lower under Chinese prompts, reducing accuracy in identifying harmful content.
  • For depression severity classification, Chinese prompts led to systematic underestimation errors, shifting predicted outcomes downward.

Why It Matters

This linguistic bias compromises the reliability of AI in global mental health applications, risking inconsistent care and diagnostic outcomes.