Research & Papers

GroupToM-Bench reveals multimodal LLMs can't grasp group dynamics

Even top multimodal models struggle to predict group outcomes from individual beliefs

Deep Dive

Humans effortlessly navigate group dynamics—reading a room, predicting mob behavior, or sensing when a team will fracture. But can multimodal large language models (MLLMs) do the same? A new paper from researchers including Weidong Tang and Yang You, accepted at ACL 2026, introduces GroupToM-Bench, the first benchmark specifically designed to test group-level Theory of Mind (ToM). Unlike individual ToM benchmarks, this one probes a causal chain spanning micro-level mental states (beliefs, desires, intentions), meso-level group tension and structural constraints, and macro-level outcome prediction. The benchmark includes a seven-level cognitive audit framework to assess how well models infer non-linear social emergence—phenomena like conformity, social tension, and collective decision-making that cannot be derived by simply summing individual intentions.

Experiments across current state-of-the-art MLLMs reveal a stark gap: models fail to capture the non-linear dynamics that humans grasp intuitively. For instance, models struggle to predict when conflicting individual desires will escalate into group polarization or when structural constraints (e.g., power hierarchies) override individual preferences. The authors argue that true general intelligence requires a social world model, not just a physical one. GroupToM-Bench provides a rigorous tool to measure progress in this critical dimension. The findings underscore that even advanced multimodal AI lacks the social cognition necessary for real-world applications like team coordination, negotiation, or crowd management. This benchmark sets a new standard for evaluating next-generation AI's social reasoning capabilities.

Key Points
  • Seven-level cognitive audit framework tests everything from individual BDI states to collective outcome prediction
  • Micro-to-macro causal chain models how beliefs, desires, and intentions interact with group tension and structural constraints
  • Human baselines significantly outperform all tested MLLMs, exposing a critical gap in social emergence reasoning

Why It Matters

Proves even advanced AI lacks true social intelligence needed for real-world group coordination.