WQ-Fusion AI model blends Whisper and Qwen for 0.836 audio benchmark score
New dual-encoder framework achieves state-of-the-art 0.836 on Interspeech challenge
Learning universal audio representations across diverse acoustic domains remains a challenge for pre-trained models. To address this, Mingda Lin and colleagues propose WQ-Fusion, a robust dual-encoder framework that combines Whisper and Qwen architectures. Instead of relying on static concatenation of features, WQ-Fusion introduces an Adaptive Feature Modulation module and a novel element-wise gated attention mechanism. This design enables the model to dynamically select and emphasize relevant acoustic and semantic dimensions, effectively routing heterogeneous information for better representation learning.
Extensive experiments on the Interspeech 2026 Audio Encoder Capability Challenge (Track A) benchmark demonstrate the effectiveness of WQ-Fusion. It achieves a superior overall score of 0.836, outperforming the strongest single-encoder baseline by a clear margin. The paper, accepted at INTERSPEECH 2026, highlights how dynamic gated attention can improve cross-domain audio tasks. This work opens new possibilities for building universal audio encoders that generalize across speech, music, and environmental sounds.
- Combines Whisper and Qwen architectures in a dual-encoder framework for audio representation learning
- Uses element-wise gated attention and Adaptive Feature Modulation for dynamic feature selection
- Achieves 0.836 overall score on Interspeech 2026 Audio Encoder Capability Challenge (Track A), beating single-encoder baselines
Why It Matters
Enables more robust universal audio representations across diverse acoustic domains.