SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations
New framework distills knowledge from two specialized teachers into one versatile model, setting new benchmarks.
A research team led by Xiaoyu Yang from Cambridge University and collaborators has introduced SPEAR (SPEech and Audio Representations), a breakthrough self-supervised learning framework that bridges the persistent gap between speech understanding and general audio event recognition. Most existing acoustic AI models specialize in either speech processing (like WavLM or HuBERT) or environmental sound analysis (like PaSST), creating fragmented capabilities. SPEAR addresses this by distilling complementary knowledge from two specialized teacher models—one optimized for speech and another for general audio—into a single unified student model through a novel knowledge distillation approach. This unified architecture represents a significant step toward general-purpose acoustic intelligence that can handle everything from voice commands to complex soundscapes.
The technical innovation centers on SPEAR's use of multi-codebook vector quantization to convert continuous teacher representations into fine-grained discrete tokens that capture both semantic meaning and acoustic characteristics. The framework employs an asymmetric pre-training loss to jointly predict these heterogeneous representations from masked audio inputs, enhanced by a novel token mixing mechanism that improves robustness in complex auditory scenes. Benchmark results are impressive: SPEAR establishes new state-of-the-art performance on the comprehensive SUPERB speech benchmark, surpassing the formidable WavLM Large model on 12 out of 15 tasks including speaker identification, emotion recognition, and intent classification. Simultaneously, it achieves competitive results on the HEAR benchmark for general audio tasks. The researchers will release both code and pre-trained models, positioning SPEAR as a versatile foundation for next-generation speech and audio applications.
- Unifies speech and audio understanding in one model by distilling knowledge from two specialized SSL teachers
- Uses multi-codebook vector quantization to create discrete tokens capturing both semantic and acoustic information
- Outperforms WavLM Large on 12 of 15 SUPERB tasks while staying competitive on HEAR audio benchmarks
Why It Matters
Enables single AI models to handle both voice interfaces and environmental sound analysis, reducing development complexity and improving real-world performance.