Research & Papers

Do Large Language Models Possess a Theory of Mind? A Comparative Evaluation Using the Strange Stories Paradigm

arXiv cs.CL March 20, 2026

⚡OpenAI's latest model demonstrates robust social reasoning, performing comparably to humans on complex mental-state attribution tasks.

Deep Dive

A research team from Eötvös Loránd University conducted a comparative evaluation of five Large Language Models (LLMs) using the Strange Stories paradigm, a text-based tool widely used in human Theory of Mind (ToM) research. The study aimed to determine whether LLMs can genuinely infer others' beliefs, intentions, and emotions from text, or if their outputs merely reflect superficial pattern completion. The models were tested on their ability to answer questions about story characters' mental states, with results revealing a significant performance gap between earlier/smaller models and the most advanced system.

GPT-4o demonstrated exceptional performance, achieving high accuracy and strong robustness comparable to human controls, even when presented with challenging conditions containing distracting information. In contrast, earlier and smaller models were strongly affected by the number of relevant inferential cues available and showed vulnerability to irrelevant information in the texts. The research contributes to ongoing philosophical and technical debates about the cognitive status of LLMs and the boundary between genuine understanding and statistical approximation in artificial intelligence systems.

The findings suggest that while current state-of-the-art models like GPT-4o can produce outputs indistinguishable from human ToM reasoning in specific text-based contexts, questions remain about whether this represents true mental-state attribution or sophisticated pattern matching. The study's methodology provides a framework for future evaluations of social-cognitive capabilities in AI systems, with implications for developing more socially intelligent agents and understanding the limitations of language-only training for developing human-like reasoning.

Key Points

GPT-4o performed comparably to humans on the Strange Stories Theory of Mind test, showing high accuracy and robustness
Earlier and smaller LLMs showed significant performance gaps and were vulnerable to distracting information in texts
The study tested five LLMs using a text-based paradigm widely used in human ToM research, adapted for AI evaluation

Why It Matters

Advances in social reasoning could enable more natural human-AI interaction and better assistive technologies for social cognition.

Read Original Article

Do Large Language Models Possess a Theory of Mind? A Comparative Evaluation Using the Strange Stories Paradigm

Why It Matters

Stay Ahead in AI