HumanStudy-Bench: AI agent human-likeness tested via validated social science experiments
Decades of replicated behavioral hypotheses become a rigorous test for LLM-based agents.
A new paper from researchers including Xuan Liu and HaoYang Shang introduces a novel framework for evaluating how human-like LLM-based agents truly are. Their key insight: if an agent is human-like, a population of such agents should reach the same inferential conclusions as humans when run through identical experiments. To operationalize this, they built HumanStudy-Bench, an open platform that converts published human-subject studies with validated, independently replicated hypotheses into reusable simulation environments. The platform scores agent-human alignment using two metrics: Probability Alignment Score (PAS), measuring inferential agreement, and Effect Consistency Score (ECS), measuring effect-size agreement.
The team curated an initial benchmark of 12 studies with robustly established findings from decades of social science research, then evaluated 10 different models under 4 distinct agent designs. Results showed that agent responses sharply polarize—either fully replicating human behavior or failing completely. Interestingly, agent design (prompting strategy, reasoning approach) had a stronger influence on human-likeness alignment than model scale (size of the LLM), but the effect was non-monotonic: better designs don't always yield better alignment in a linear fashion. This work provides a more objective, decomposable, and scalable evaluation methodology for measuring whether AI agents truly think like humans.
- HumanStudy-Bench leverages 12 independently replicated behavioral hypotheses from social science as test scenarios.
- Two metrics score alignment: PAS (inferential agreement) and ECS (effect-size consistency) across agent populations.
- Agent design influences human-likeness more than model scale, but the relationship is non-monotonic—better prompts don't guarantee linear improvement.
Why It Matters
Offers a rigorous, scalable benchmark for evaluating whether AI agents genuinely mimic human decision-making.