AdaLTM: Adaptive Layer-wise Task Vector Merging for Categorical Speech Emotion Recognition with ASR Knowledge Integration
New framework combines speech-to-text and emotion detection without the optimization conflicts that plague multi-task models.
A research team from National Tsing Hua University and the University of Southern California has introduced AdaLTM, a novel method for integrating Automatic Speech Recognition (ASR) with Speech Emotion Recognition (SER). The core innovation is the use of "task vectors"—representations of learned knowledge from models fine-tuned on specific tasks—which are merged into a frozen base model (WavLM-Large) rather than training a single model on both tasks simultaneously. This sidesteps the classic problem of optimization conflict, where improving performance on one task can degrade performance on the other.
The AdaLTM framework extracts separate task vectors from pre-trained ASR and SER models. It then integrates them into the base model using a set of learnable, layer-wise coefficients. This allows the system to adaptively balance how much linguistic (ASR) versus paralinguistic (emotion) information is emphasized at different depths of the neural network's transformer layers. Experiments on the benchmark MSP-Podcast dataset show this method successfully mitigates the conflict between the two tasks, leading to more robust emotion recognition by leveraging contextual speech understanding without the performance bottlenecks of simple feature fusion.
- Uses 'task vector merging' from NLP/Computer Vision, applying it to speech tasks for the first time to combine ASR and emotion models.
- Employs layer-specific, learnable coefficients to adaptively balance linguistic and emotional cues across the depth of a WavLM-Large transformer.
- Demonstrated on the MSP-Podcast dataset, effectively solving the optimization conflict that hampers traditional multi-task learning approaches.
Why It Matters
Enables more accurate, context-aware AI for customer service, mental health tools, and content analysis by reliably understanding both what is said and how it's said.