Agent Frameworks

Social Norm Reasoning in Multimodal Language Models: An Evaluation

A new study tested five MLLMs on 60 social scenarios to see if they understand human norms.

Deep Dive

A new research paper from the University of Otago, titled 'Social Norm Reasoning in Multimodal Language Models: An Evaluation,' provides a crucial benchmark for how well today's leading AI models understand the unwritten rules of human social interaction. The study, to be published in ICAART 2026, tested five major MLLMs—including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Llama 3.2 11B Vision, and Qwen-2.5VL—on their ability to answer norm-related questions based on 60 short stories. This work bridges the gap between traditional symbolic AI approaches used in Normative Multi-Agent Systems (NorMAS) and the emerging capabilities of foundation models, evaluating their potential to power more socially-aware robots and software agents.

The evaluation revealed a clear performance hierarchy and significant modality gap. OpenAI's GPT-4o delivered the most competent norm reasoning across both text and image inputs, positioning it as the most promising candidate for integration into social AI systems. The open-source model Qwen-2.5VL from Alibaba emerged as a strong, cost-free alternative, placing second. A critical finding was that all models performed 'superior' in text-based scenarios compared to image-based ones, highlighting a major challenge in visual social understanding. Furthermore, the research notes that all tested models found reasoning about 'complex norms' particularly challenging, indicating a key frontier for future model development. This benchmark sets the stage for building AI that can navigate nuanced human contexts, from customer service bots to collaborative robots.

Key Points
  • GPT-4o ranked #1 in social norm reasoning across text and image modalities, beating Claude, Gemini, and Llama.
  • All five tested MLLMs showed a significant performance gap, handling text-based norms far better than image-based scenarios.
  • The open-source model Qwen-2.5VL was the top-performing free model, offering a viable alternative for cost-sensitive applications.

Why It Matters

This benchmark is essential for developing AI assistants and robots that can interact safely and appropriately in human social environments.