Audio & Speech

A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

A new MoE framework analyzes speech and text without speaker ID, achieving state-of-the-art emotion detection.

Deep Dive

A team of researchers has introduced MiSTER-E (Mixture of Speech-Text Experts for Recognition of Emotions), a novel framework designed to tackle the complex challenge of Emotion Recognition in Conversations (ERC). The model addresses two core problems: modeling context within a dialogue's flow and fusing information from multiple modalities, specifically speech and text. Unlike many existing systems, MiSTER-E operates without any speaker identity information, focusing purely on the acoustic and linguistic content. It leverages fine-tuned large language models (LLMs) to generate rich embeddings from individual utterances, which are then processed through a convolutional-recurrent network to capture temporal dependencies across the conversation.

The technical architecture employs a Mixture-of-Experts (MoE) approach with three specialized 'experts': one for speech-only analysis, one for text-only, and one for cross-modal fusion. A learned gating network dynamically weights the predictions from these experts for each input. To improve performance, the team introduced a supervised contrastive loss to align speech and text representations and a KL-divergence regularization to encourage consistency among expert outputs. Evaluated on the IEMOCAP, MELD, and MOSI benchmarks, MiSTER-E achieved weighted F1-scores of 70.9%, 69.5%, and 87.9% respectively, outperforming prior multimodal baselines. This advancement points toward more nuanced and privacy-conscious AI for analyzing human interactions in telehealth, customer experience, and conversational AI.

Key Points
  • Achieves 70.9% weighted F1-score on the IEMOCAP benchmark, outperforming previous speech-text ERC systems.
  • Uses a Mixture-of-Experts (MoE) framework with three specialized models and a dynamic gating mechanism, without requiring speaker identity.
  • Incorporates a supervised contrastive loss and KL-divergence regularization to better align speech and text representations.

Why It Matters

Enables more accurate, privacy-preserving analysis of emotional tone in customer support, mental health sessions, and human-AI interaction.