New UME Architecture Unifies Diarization, Separation, and ASR
Shared encoder slashes error rates on overlapping speech tasks by 50%+
Deep Dive
A new architecture called Unified Multi-Speaker Encoder (UME) jointly learns representations for speaker diarization, speech separation, and multi-speaker ASR using a shared speech foundational encoder. By leveraging residual weighted-sum encoding from multiple layers, UME captures interdependencies between tasks. On Libri2Mix and Libri3Mix, it achieves diarization error rates of 1.37% and 2.29%, outperforming dedicated baselines. The work was accepted to IEEE ASRU 2025.
Key Points
- UME jointly trains speaker diarization, speech separation, and multi-speaker ASR using a shared encoder
- Achieves diarization error rates of 1.37% on Libri2Mix and 2.29% on Libri3Mix, outperforming dedicated baselines
- Uses residual weighted-sum encoding (RWSE) from multiple encoder layers for cross-task alignment
Why It Matters
One model now outperforms three separate systems for real-time meeting transcription and multi-speaker audio.