EmoOmni: Bridging Emotional Understanding and Expression in Omni-Modal LLMs
The 7B parameter model matches performance of 30B competitors by explicitly modeling emotional reasoning.
A research team led by Wenjie Tian has introduced EmoOmni, a novel framework designed to solve the emotional intelligence gap in current multimodal large language models (Omni-LLMs). While existing models like GPT-4o and Claude 3.5 can process audio and video, they often produce contextually mismatched emotional responses due to their 'Thinker-Talker' architectures where emotional details are lost in hidden state connections. EmoOmni directly addresses this by creating an explicit emotional reasoning pathway.
The core innovation is the Emotional Chain-of-Thought (E-CoT), which forces the model to reason from fine-grained multimodal perception (tone, facial expression) to appropriate textual and speech responses, treating this reasoning as high-level instructions for the 'talker' component. The team also built EmoOmniPipe for real-world annotated dialogue data and established the EmoOmniEval benchmark. Remarkably, their 7-billion parameter EmoOmni-7B model matches the performance of the much larger 30-billion parameter Qwen3Omni-30B-A3B-Thinking model when using the same talker module, demonstrating a more than 4x parameter efficiency gain while delivering superior emotional coherence.
- Introduces Emotional Chain-of-Thought (E-CoT) reasoning that explicitly connects multimodal perception to emotional expression
- EmoOmni-7B achieves performance comparable to Qwen3Omni-30B-A3B-Thinking, showing 4x+ parameter efficiency
- Includes new EmoOmniPipe dataset pipeline and EmoOmniEval benchmark for systematic assessment of emotional dialogue AI
Why It Matters
Enables AI assistants and chatbots to deliver more natural, context-aware emotional responses across voice and video interactions.