OmniToM benchmark exposes LLM Theory of Mind gaps
New benchmark reveals LLMs fail to model human-like beliefs and mental states
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A team led by Adam Bawatneh from the University of Central Florida (with co-authors from multiple institutions) has introduced OmniToM, a novel benchmark designed to rigorously evaluate Theory of Mind (ToM) capabilities in large language models. Unlike traditional approaches that assess ToM through end-point question answering, OmniToM demands explicit modeling of belief structures for all actors within narrative contexts. These structures are decomposed into belief propositions—minimal statements capturing what an actor believes to be true about the world or others' mental states.
The benchmark consists of 895 stories from the existing ToMBench corpus, augmented with 22,343 human-calibrated belief propositions labeled across seven dimensions: recursive order, truth status, knowledge access, explicitness, content type, mental source, and context. In zero-shot evaluations across diverse models, OmniToM revealed a critical bottleneck: LLMs struggle to transform narrative facts into accurate actor-specific beliefs and shared mental states, particularly in tracking knowledge access and representational nuances. This exposes a fundamental gap in current LLM architectures' ability to model dynamic, divergent, or mistaken beliefs—a core aspect of human social cognition.
- OmniToM evaluates Theory of Mind in LLMs by requiring explicit belief modeling, not just end-point answers
- Benchmark comprises 895 stories with 22,343 labeled belief propositions across seven dimensions
- Zero-shot tests show LLMs fail to accurately track actor-specific beliefs and knowledge access
Why It Matters
Highlights critical limitations in LLMs' social reasoning, impacting applications in chatbots, customer service, and mental health tools.