Agent Frameworks

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

A massive new test shows AI is still terrible at real-world mental health diagnosis.

Deep Dive

Researchers released LingxiDiagBench, a multi-agent framework testing LLMs on Chinese psychiatric consultation. The core is a dataset of 16,000 synthetic patient dialogues across 12 mental health categories. Key findings show models achieve 92.3% accuracy on simple binary classification but crash to just 28.5% on complex 12-way differential diagnosis. The benchmark reveals dynamic patient consultation often hurts performance, and fluent questioning doesn't guarantee correct diagnosis.

Why It Matters

This exposes a critical gap in AI's ability to handle nuanced, real-world medical reasoning, tempering hype for near-term AI therapists.