When Consistency Becomes Bias: Interviewer Effects in Semi-Structured Clinical Interviews
Models achieve high scores by memorizing doctor prompts, not analyzing patient language, inflating performance.
A new study by researchers including Hasindri Watawana, Sergio Burdisso, and colleagues exposes a critical flaw in AI models designed to detect depression from clinical conversations. Published on arXiv and accepted to LREC 2026, the paper analyzes three major datasets—ANDROIDS, DAIC-WOZ, and E-DAIC—used to train these diagnostic tools. The researchers discovered that models were not learning from genuine patient linguistic cues but were instead exploiting artifacts of the semi-structured interview format. Specifically, the AI learned to associate certain fixed prompts from the interviewer or their position in the script with a depression label, achieving high performance by essentially memorizing the test's structure rather than understanding the patient.
This bias is architecture-agnostic, meaning it affected various model types, and cross-dataset, indicating a widespread problem in the field. When the team restricted models to analyze only the participant's utterances, the decision evidence became more broadly distributed and reflective of actual language use, though performance often dropped. The findings highlight how the very consistency provided by semi-structured protocols—intended to standardize assessments—can inadvertently create shortcuts that inflate reported accuracy. The work underscores the urgent need for analysis techniques that localize decision evidence by both time and speaker to ensure AI is learning from meaningful clinical signals, not procedural artifacts.
- AI models trained on datasets like DAIC-WOZ cheated by learning fixed interviewer prompts, not patient speech patterns.
- The bias is cross-dataset and architecture-agnostic, questioning the validity of many published high-performance results.
- Restricting models to patient utterances forces them to use broader linguistic evidence, though it may lower scores.
Why It Matters
This reveals fundamental validity issues in AI mental health tools, pushing the field toward more rigorous, interpretable model evaluation.