OmniToM evaluates Theory of Mind in LLMs by requiring explicit belief modeling, not just end-point answers?

OmniToM evaluates Theory of Mind in LLMs by requiring explicit belief modeling, not just end-point answers

Benchmark comprises 895 stories with 22,343 labeled belief propositions across seven dimensions?

Benchmark comprises 895 stories with 22,343 labeled belief propositions across seven dimensions

Zero-shot tests show LLMs fail to accurately track actor-specific beliefs and knowledge access?

Zero-shot tests show LLMs fail to accurately track actor-specific beliefs and knowledge access

Research & Papers

OmniToM benchmark exposes LLM Theory of Mind gaps

arXiv cs.AI May 27, 2026

⚡New benchmark reveals LLMs fail to model human-like beliefs and mental states

Deep Dive

A team led by Adam Bawatneh from the University of Central Florida (with co-authors from multiple institutions) has introduced OmniToM, a novel benchmark designed to rigorously evaluate Theory of Mind (ToM) capabilities in large language models. Unlike traditional approaches that assess ToM through end-point question answering, OmniToM demands explicit modeling of belief structures for all actors within narrative contexts. These structures are decomposed into belief propositions—minimal statements capturing what an actor believes to be true about the world or others' mental states.

The benchmark consists of 895 stories from the existing ToMBench corpus, augmented with 22,343 human-calibrated belief propositions labeled across seven dimensions: recursive order, truth status, knowledge access, explicitness, content type, mental source, and context. In zero-shot evaluations across diverse models, OmniToM revealed a critical bottleneck: LLMs struggle to transform narrative facts into accurate actor-specific beliefs and shared mental states, particularly in tracking knowledge access and representational nuances. This exposes a fundamental gap in current LLM architectures' ability to model dynamic, divergent, or mistaken beliefs—a core aspect of human social cognition.

Key Points

OmniToM evaluates Theory of Mind in LLMs by requiring explicit belief modeling, not just end-point answers
Benchmark comprises 895 stories with 22,343 labeled belief propositions across seven dimensions
Zero-shot tests show LLMs fail to accurately track actor-specific beliefs and knowledge access

Why It Matters

Highlights critical limitations in LLMs' social reasoning, impacting applications in chatbots, customer service, and mental health tools.

Read Original Article

OmniToM benchmark exposes LLM Theory of Mind gaps

Why It Matters

Related Articles

🚀 Stay Ahead in AI