Research & Papers

Omni-MMSI: Toward Identity-attributed Social Interaction Understanding

New pipeline tackles the 'cocktail party problem' for AI, attributing speech and actions to specific people.

Deep Dive

A research team led by Xinpeng Li from Georgia Tech, with collaborators from UC Irvine and other institutions, has introduced a significant new challenge for AI: Omni-MMSI (Omni-modal Multi-person Social Interaction Understanding). Published and accepted to CVPR 2026, this task requires AI systems to process raw, unprocessed audio, visual, and speech data from multi-person scenarios and perform two key functions: perceive identity-attributed social cues (e.g., determining *who* is speaking *what*) and reason about the social interaction itself (e.g., figuring out *whom* a speaker is referring to). This moves beyond prior work that relied on pre-processed, 'oracle' data, forcing models to handle the messy reality AI assistants face.

To solve this, the team proposed Omni-MMSI-R, a novel reference-guided pipeline. This system first uses specialized tools to produce identity-attributed social cues from the raw data. It then employs chain-of-thought reasoning to interpret the social dynamics. To train and evaluate it, the researchers constructed participant-level reference pairs and curated new reasoning annotations on top of existing datasets. In experiments, Omni-MMSI-R demonstrated superior performance, significantly outperforming advanced general-purpose LLMs and other multi-modal counterparts that lack robust identity attribution capabilities. This failure of existing models highlights the core technical gap the research addresses.

Key Points
  • Defines the new Omni-MMSI task for AI to understand who-said-what-to-whom from raw multi-modal data.
  • Proposes Omni-MMSI-R pipeline that uses tools for identity attribution and chain-of-thought reasoning, beating advanced LLMs.
  • Accepted to top-tier conference CVPR 2026, targeting a core limitation for future conversational AI and assistants.

Why It Matters

Enables AI assistants to navigate complex group conversations, a critical step beyond today's one-on-one chat interfaces.