Agent Frameworks

Social Norm Reasoning in Multimodal Language Models: An Evaluation

arXiv cs.MA March 05, 2026

⚡A new study tested five MLLMs on 60 social scenarios to see if they understand human norms.

Deep Dive

A new research paper from the University of Otago, titled 'Social Norm Reasoning in Multimodal Language Models: An Evaluation,' provides a crucial benchmark for how well today's leading AI models understand the unwritten rules of human social interaction. The study, to be published in ICAART 2026, tested five major MLLMs—including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Llama 3.2 11B Vision, and Qwen-2.5VL—on their ability to answer norm-related questions based on 60 short stories. This work bridges the gap between traditional symbolic AI approaches used in Normative Multi-Agent Systems (NorMAS) and the emerging capabilities of foundation models, evaluating their potential to power more socially-aware robots and software agents.

The evaluation revealed a clear performance hierarchy and significant modality gap. OpenAI's GPT-4o delivered the most competent norm reasoning across both text and image inputs, positioning it as the most promising candidate for integration into social AI systems. The open-source model Qwen-2.5VL from Alibaba emerged as a strong, cost-free alternative, placing second. A critical finding was that all models performed 'superior' in text-based scenarios compared to image-based ones, highlighting a major challenge in visual social understanding. Furthermore, the research notes that all tested models found reasoning about 'complex norms' particularly challenging, indicating a key frontier for future model development. This benchmark sets the stage for building AI that can navigate nuanced human contexts, from customer service bots to collaborative robots.

Key Points

GPT-4o ranked #1 in social norm reasoning across text and image modalities, beating Claude, Gemini, and Llama.
All five tested MLLMs showed a significant performance gap, handling text-based norms far better than image-based scenarios.
The open-source model Qwen-2.5VL was the top-performing free model, offering a viable alternative for cost-sensitive applications.

Why It Matters

This benchmark is essential for developing AI assistants and robots that can interact safely and appropriately in human social environments.

Read Original Article

Social Norm Reasoning in Multimodal Language Models: An Evaluation

Why It Matters

Stay Ahead in AI