Audio & Speech

CTRnet and PuLSS achieve SOTA speech separation on CHiME-6

First neural method to beat guided source separation on real conversational data.

Deep Dive

Researchers Zhong-Qiu Wang and Samuele Cornell have introduced CTRnet, a novel cross-talk reduction (CTR) method that isolates a wearer's speech from noisy close-talk microphone recordings. These close-talk signals, though louder for the wearer, often contain strong speech from other speakers. The team also proposes PuLSS (Pseudo-Label based Far-Field Speech Separation), which uses CTRnet’s clean estimates as pseudo-labels to train far-field separation models. A key innovation is that both components can be trained directly on real-recorded data from the target domain, eliminating the generalization gap common with simulated training data.

On the challenging CHiME-6 dataset, the framework achieves state-of-the-art automatic speech recognition (ASR) performance under both oracle and estimated speaker diarization, outperforming all submissions from the CHiME-7 and CHiME-8 challenges. To the authors’ knowledge, this is the first neural speech separation method to substantially outperform guided source separation on real conversational "speech-in-the-wild" data. The work represents a significant step toward robust, real-world speech processing systems.

Key Points
  • CTRnet reduces cross-talk in close-talk microphones by learning from real noisy pairs.
  • PuLSS uses CTRnet's outputs as pseudo-labels to train far-field separation models, avoiding synthetic data.
  • Achieves top ASR scores on CHiME-6, beating all CHiME-7/8 challenge entries.

Why It Matters

Enables far-field speech separation without simulated data, improving real-world voice assistants and meeting transcription.