One-shot emergency psychiatric triage across 15 frontier AI chatbots
New study reveals 94.3% accuracy for emergencies, but only 19.7% for routine cases
A new study published on arXiv (2604.25415) tested 15 frontier AI chatbots on one-shot psychiatric triage using 112 realistic clinical vignettes. Each vignette was paired with one of four triage labels: routine (A), assessment within 1 week (B), assessment within 24-48 hours (C), or emergency care now (D). The chatbots had to assign a triage label based solely on a single user message, covering 9 psychiatric presentation clusters and 9 risk dimensions. Results showed near-perfect emergency detection (94.3% accuracy for level D), with only 5.6% under-triage—all of which were still flagged as high urgency (level C). However, accuracy dropped sharply for intermediate cases, particularly level B (19.7%). The mean signed ordinal error was +0.47, indicating net over-triage, meaning chatbots tended to overestimate risk for low and moderate presentations.
These findings were validated against clinician consensus from 50 medical doctors, confirming the trend. The study underscores that frontier AI chatbots can reliably recognize psychiatric emergencies requiring immediate care, but they over-triage routine and intermediate cases—potentially overwhelming healthcare systems with false alarms. The authors note that while AI shows promise for crisis detection, its tendency to over-flag non-urgent cases could lead to unnecessary resource strain. This research highlights the need for calibration in AI-based triage tools, especially for psychiatric contexts where urgency is inferred from subjective cues rather than objective metrics.
- 15 frontier AI chatbots evaluated on 112 psychiatric vignettes with 4 triage levels (A-D)
- 94.3% accuracy for emergency cases (level D), but only 19.7% for routine cases (level B)
- 5.6% emergency under-triage rate, all misclassified cases still assigned to high urgency (level C)
Why It Matters
AI chatbots show promise for psychiatric crisis detection but need calibration to avoid over-triage and false alarms.