HumanStudy-Bench leverages 12 independently replicated behavioral hypotheses from social science as test scenarios?

HumanStudy-Bench leverages 12 independently replicated behavioral hypotheses from social science as test scenarios.

Two metrics score alignment?

PAS (inferential agreement) and ECS (effect-size consistency) across agent populations.

Agent design influences human-likeness more than model scale, but the relationship is non-monotonic—better prompts don't guarantee linear improvement?

Agent design influences human-likeness more than model scale, but the relationship is non-monotonic—better prompts don't guarantee linear improvement.

AI Safety

HumanStudy-Bench: AI agent human-likeness tested via validated social science experiments

arXiv cs.CY May 18, 2026

⚡Decades of replicated behavioral hypotheses become a rigorous test for LLM-based agents.

Deep Dive

A new paper from researchers including Xuan Liu and HaoYang Shang introduces a novel framework for evaluating how human-like LLM-based agents truly are. Their key insight: if an agent is human-like, a population of such agents should reach the same inferential conclusions as humans when run through identical experiments. To operationalize this, they built HumanStudy-Bench, an open platform that converts published human-subject studies with validated, independently replicated hypotheses into reusable simulation environments. The platform scores agent-human alignment using two metrics: Probability Alignment Score (PAS), measuring inferential agreement, and Effect Consistency Score (ECS), measuring effect-size agreement.

The team curated an initial benchmark of 12 studies with robustly established findings from decades of social science research, then evaluated 10 different models under 4 distinct agent designs. Results showed that agent responses sharply polarize—either fully replicating human behavior or failing completely. Interestingly, agent design (prompting strategy, reasoning approach) had a stronger influence on human-likeness alignment than model scale (size of the LLM), but the effect was non-monotonic: better designs don't always yield better alignment in a linear fashion. This work provides a more objective, decomposable, and scalable evaluation methodology for measuring whether AI agents truly think like humans.

Key Points

HumanStudy-Bench leverages 12 independently replicated behavioral hypotheses from social science as test scenarios.
Two metrics score alignment: PAS (inferential agreement) and ECS (effect-size consistency) across agent populations.
Agent design influences human-likeness more than model scale, but the relationship is non-monotonic—better prompts don't guarantee linear improvement.

Why It Matters

Offers a rigorous, scalable benchmark for evaluating whether AI agents genuinely mimic human decision-making.

Read Original Article

HumanStudy-Bench: AI agent human-likeness tested via validated social science experiments

Why It Matters

Related Articles

🚀 Stay Ahead in AI