Achieves diarization error rates of 1.37% on Libri2Mix and 2.29% on Libri3Mix, outperforming dedicated baselines?

Achieves diarization error rates of 1.37% on Libri2Mix and 2.29% on Libri3Mix, outperforming dedicated baselines

Uses residual weighted-sum encoding (RWSE) from multiple encoder layers for cross-task alignment?

Uses residual weighted-sum encoding (RWSE) from multiple encoder layers for cross-task alignment

Audio & Speech

New UME Architecture Unifies Diarization, Separation, and ASR

arXiv eess.AS May 14, 2026

⚡Shared encoder slashes error rates on overlapping speech tasks by 50%+

Deep Dive

A new architecture called Unified Multi-Speaker Encoder (UME) jointly learns representations for speaker diarization, speech separation, and multi-speaker ASR using a shared speech foundational encoder. By leveraging residual weighted-sum encoding from multiple layers, UME captures interdependencies between tasks. On Libri2Mix and Libri3Mix, it achieves diarization error rates of 1.37% and 2.29%, outperforming dedicated baselines. The work was accepted to IEEE ASRU 2025.

Key Points

UME jointly trains speaker diarization, speech separation, and multi-speaker ASR using a shared encoder
Achieves diarization error rates of 1.37% on Libri2Mix and 2.29% on Libri3Mix, outperforming dedicated baselines
Uses residual weighted-sum encoding (RWSE) from multiple encoder layers for cross-task alignment

Why It Matters

One model now outperforms three separate systems for real-time meeting transcription and multi-speaker audio.

Read Original Article

New UME Architecture Unifies Diarization, Separation, and ASR

Why It Matters

Related Articles

🚀 Stay Ahead in AI