Audio & Speech

UNet-Based Fusion and Exponential Moving Average Adaptation for Noise-Robust Speaker Recognition

A scalable framework that fuses noisy and enhanced speech for robust speaker ID.

Deep Dive

Researchers from multiple institutions have introduced UF-EMA (UNet-based Fusion with Exponential Moving Average Adaptation), a new framework for noise-robust speaker recognition. Traditional approaches jointly train speech enhancement and speaker embedding networks, but they often fail to leverage the benefits of large-scale pre-training on clean speech and do not explicitly preserve speaker information during denoising. UF-EMA addresses these limitations by treating noisy and enhanced speech as a multi-channel input to a UNet-based fusion module, enabling the speaker encoder to exploit both signals effectively.

Additionally, the framework applies an Exponential Moving Average (EMA) strategy to a speaker encoder pre-trained on clean speech. This smooths the adaptation to noisy conditions and mitigates overfitting. Experimental results on multiple noise-contaminated test sets show that UF-EMA outperforms existing joint training methods, demonstrating superior robustness and generalization. The paper has been submitted to Interspeech 2026 and is available on arXiv.

Key Points
  • UF-EMA uses a UNet-based fusion module to combine noisy and enhanced speech as multi-channel input for speaker recognition.
  • Exponential Moving Average (EMA) adaptation smooths the transition from clean to noisy conditions, reducing overfitting.
  • Outperforms existing joint training methods on multiple noise-contaminated test sets.

Why It Matters

Enables reliable speaker identification in real-world noisy environments, improving security and voice assistant accuracy.