SOTOPIA-TOM: Evaluating Information Management in Multi-Agent Interaction with Theory of Mind
Even the largest LLM falters when juggling information sharing and secrecy.
As LLM‑based agents move into multi‑party settings like customer service or team coordination, they must handle information asymmetry—sharing useful details while protecting privacy. Existing benchmarks ignore these realistic constraints. To fill the gap, researchers from multiple institutions created SOTOPIA‑TOM, a multi‑dimensional framework that tests agents across 160 scenarios spanning 8 industry sectors. Each scenario involves 3–5 agents with partitioned private knowledge and both broadcast and direct‑message channels. The evaluation measures how well agents share relevant information, seek missing data, coordinate efficiently, and avoid privacy leaks, all rolled into a composite INFOMGMT score.
Across 6 LLM backbones and several prompting strategies, even the most powerful reasoning model (GPT‑5) managed only a 62% INFOMGMT score, exposing persistent weaknesses in information seeking and privacy‑aware decision‑making. Notably, Theory‑of‑Mind (ToM) interventions proved effective: on GPT‑4o, a ToM‑Coaching approach reduced critical privacy violations from 9.9% to 2.2% and boosted the InfoMgmt score from 15% to 40%—a 2.5× improvement. SOTOPIA‑TOM is now available as an extensible testbed for building more privacy‑aware, ToM‑capable multi‑agent systems, highlighting a crucial gap in current LLM agent capabilities.
- 160 human-reviewed scenarios across 8 industry sectors with 3–5 agents each, using public and private communication channels
- GPT-5 scored only 62% on the composite INFOMGMT metric, the highest among tested models
- Theory‑of‑Mind interventions reduced GPT‑4o’s privacy violations from 9.9% to 2.2% and increased InfoMgmt score 2.5×
Why It Matters
Exposes critical weaknesses in LLM agents for real-world collaborations where privacy and information sharing must be balanced.