ML-SAN: Multi-Level Speaker-Adaptive Network for Emotion Recognition in Conversations
ML-SAN uses a three-stage process to overcome individual differences in emotional expression...
Researchers from Xinjiang University have introduced ML-SAN (Multi-Level Speaker-Adaptive Network), a novel approach to emotion recognition in conversations that addresses a critical flaw in current systems: the inability to account for individual differences in emotional expression. Traditional models treat all speakers the same, leading to confusion when different people express the same emotion differently—some show happiness through facial expressions, others through words or actions. ML-SAN tackles this with a three-stage adaptive process: Input-level Calibration uses Feature-Level Linear Modulation (FiLM) to adjust raw audio and visual features into a speaker-neutral space; Interaction-level Gating re-adjusts trust levels for each modality based on speaker identity; and Output-level Regularization maintains consistency of speaker features in the latent space.
Tested on the MELD and IEMOCAP datasets, ML-SAN achieves better results than existing models, especially for challenging tail sentiment categories (rare emotions) and handling speaker diversity in multi-turn dialogues. The paper, accepted for the International Conference on Intelligent Computing 2026, highlights how ML-SAN moves beyond static emotion recognition to a more dynamic, speaker-adaptive approach. This could significantly improve human-machine empathy in applications like customer service, mental health chatbots, and virtual assistants, making them more responsive to individual emotional nuances.
- ML-SAN uses a three-stage adaptive process: Input-level Calibration (FiLM), Interaction-level Gating, and Output-level Regularization
- Outperforms existing models on MELD and IEMOCAP datasets, especially for rare tail sentiment categories
- Addresses the problem of speaker identity information confusion in multi-turn dialogues
Why It Matters
ML-SAN enables more empathetic AI by adapting to individual expressive traits, improving emotion recognition in real-world conversations.