LLMs in social services: How does chatbot accuracy affect human accuracy?
High-accuracy AI suggestions improved human performance, but incorrect answers caused a two-thirds drop in accuracy on easy questions.
A new study from researchers at Stanford and Cornell, led by Jennah Gosciak, investigates a critical real-world question: how do AI chatbot suggestions affect the accuracy of human experts? The team focused on social service caseworkers who navigate complex programs like SNAP (food stamps). They built a rigorous 770-question benchmark of difficult, realistic eligibility scenarios and conducted a randomized experiment with caseworkers from Los Angeles nonprofits.
Caseworkers without AI assistance had a baseline accuracy of 49%. When provided with high-quality chatbot suggestions (96-100% accurate), their performance soared by 27 percentage points. However, the study uncovered a major risk: incorrect AI suggestions caused a dramatic two-thirds reduction in caseworker accuracy on questions the control group handled well. The research also identified an 'AI underreliance plateau,' where human performance levels off despite further improvements in AI accuracy, highlighting that perfect AI doesn't guarantee perfect human-AI team outcomes.
This work moves beyond standard AI benchmarks to measure the combined system of 'human-in-the-loop.' The findings have immediate implications for deploying LLMs in high-stakes domains like healthcare, legal aid, and government services, where an AI's error can directly amplify human error. The study underscores that evaluating AI tools requires testing them with their intended users in realistic scenarios.
- High-accuracy AI (96-100%) boosted caseworker performance by 27 percentage points from a 49% baseline.
- Incorrect AI suggestions caused a two-thirds reduction in human accuracy on easy questions.
- Researchers identified an 'AI underreliance plateau' where human performance stops improving despite better AI.
Why It Matters
For deploying AI in critical services, the quality of suggestions is paramount, as bad AI can be worse than no AI at all.