Joint Learning using Mixture-of-Expert-Based Representation for Speech Enhancement and Robust Emotion Recognition
New Mixture-of-Experts framework slashes noise interference while preserving emotional cues...
Researchers Jing-Tong Tzeng, Carlos Busso, and Chi-Chun Lee have introduced Sparse MERIT (Sparse Mixture-of-Experts Representation Integration Technique), a novel multi-task learning framework that simultaneously handles speech enhancement and robust emotion recognition. Traditional approaches either cascade a speech enhancement front-end with an emotion classifier—introducing artifacts that mask emotional cues—or use shared-backbone models that suffer from gradient interference between tasks. Sparse MERIT solves this by applying frame-wise expert routing over self-supervised speech representations, using task-specific gating networks that dynamically select from a shared pool of experts for each frame.
Experiments on the MSP-Podcast corpus show dramatic gains under noisy conditions. At -5 dB SNR (the most challenging tested), Sparse MERIT improves SER F1-macro by 12.0% over a speech enhancement pre-processing baseline and by 3.4% over a naive multi-task learning baseline, with statistical significance on unseen noise conditions. For speech enhancement, it achieves 28.2% better segmental SNR than the pre-processing approach and 20.0% improvement over the naive MTL baseline. The framework is parameter-efficient and task-adaptive, making it suitable for real-world deployment where both audio quality and emotion detection matter—such as call centers, voice assistants, and mental health monitoring systems. The paper has been accepted by IEEE Transactions on Audio, Speech and Language Processing.
- Sparse MERIT uses frame-wise Mixture-of-Experts routing over self-supervised representations to avoid gradient interference between speech enhancement and emotion recognition tasks
- At -5 dB SNR, it improves SER F1-macro by 12.0% over a speech enhancement pre-processing baseline and 3.4% over a naive multi-task baseline
- Speech enhancement quality improves by 28.2% in segmental SNR over the pre-processing baseline
Why It Matters
Enables reliable emotion detection in noisy environments—critical for call centers, voice assistants, and mental health monitoring.