LLM Digital Twins Show High Accuracy but Fail on Heuristic Biases
Digital twins matched aggregate human responses but missed item-level nuances and decision biases.
A new arXiv paper by Yufei Zhang and Zhihao Ma systematically evaluates the psychometric comparability of LLM-based digital twins—AI models designed to mimic human respondents. The authors propose a construct validity framework covering both construct representation and nomothetic span, benchmarking against human gold standards. While digital twins achieved high aggregate-level accuracy and strong profile correlations across multiple tasks, they consistently showed attenuated item-level correlations, revealing a gap between group averages and individual response patterns. In word association tests, LLM networks exhibited humanlike small-world structure and theory-consistent communities, but diverged lexically and in local structural details.
When tested on decision-making and contextualized tasks, the digital twins under-reproduced well-known heuristic biases, instead demonstrating normative rationality, compressed variance, and limited sensitivity to temporal context. Feature-rich and trait-relevant conditioning improved Big Five personality predictions and nomothetic-span alignment, yet network invariance remained limited with only partial configural solutions and persistent loading differences. In cross-language free-text tasks (English and Chinese), feature-rich digital twins better approximated construct-level narrative content, but linguistic and idiographic differences persisted. The study concludes that digital twins are most reliable when their use is bounded by validated contexts where construct, task, and inference level are aligned with human data.
- Digital twins achieved high aggregate accuracy but had attenuated item-level correlations, missing individual nuances.
- In decision-making tasks, LLMs under-reproduced heuristic biases, showing normative rationality and compressed variance.
- Feature-rich conditioning improved Big Five predictions but network invariance remained limited with configural differences.
Why It Matters
Sets clear boundaries for using LLMs as human proxies: useful for aggregates, not individual nuances or biased decisions.