EvoTSE: Evolving Enrollment for Target Speaker Extraction
New framework continuously updates speaker profiles, boosting accuracy in noisy, real-world audio.
A research team from Northwestern Polytechnical University and Xiamen University has introduced EvoTSE, a significant evolution in Target Speaker Extraction (TSE) technology. Traditional TSE isolates a specific speaker's voice from a mixture using a fixed, pre-recorded sample (enrollment). However, this static approach is brittle; poor-quality enrollment or speaker confusion—where the model extracts the wrong voice—severely limits performance. EvoTSE tackles this by making the enrollment dynamic. Instead of relying on a single reference, the framework continuously refines its understanding of the target speaker by retrieving and filtering high-confidence estimates from its own past, successful extractions within the same audio stream.
This self-improving mechanism acts like an AI that learns the speaker's voice in real-time. It reduces dependency on a perfect initial recording and makes the system more robust against errors. The researchers validated EvoTSE across multiple benchmarks, where it achieved consistent performance gains. Crucially, it showed particular strength in out-of-domain (OOD) scenarios—noisy, real-world conditions different from its training data—where conventional TSE often struggles. By closing the loop between extraction and enrollment, EvoTSE represents a move towards more adaptive and reliable speech separation systems. The team has open-sourced the code and model checkpoints to foster further development.
- Dynamically updates the speaker enrollment using reliability-filtered retrieval from historical estimates, moving beyond a static reference.
- Reduces 'speaker confusion' errors and relaxes quality requirements for the initial pre-recorded enrollment sample.
- Shows consistent improvements in benchmarks, with notable gains in challenging out-of-domain (OOD) audio scenarios.
Why It Matters
Enables clearer voice isolation in noisy real-world calls, meetings, and audio forensics, making AI hearing more reliable.