Audio & Speech

Reverberation-Robust Localization of Speakers Using Distinct Speech Onsets and Multi-channel Cross-Correlations

New method uses speech onsets and multi-channel data to track speakers where current tech fails.

Deep Dive

A new research paper by Shoufeng Lin tackles one of audio processing's toughest challenges: accurately locating speakers in rooms with heavy echoes, known as reverberation. The paper introduces two distinct algorithms designed to function where many existing methods struggle. The first algorithm innovates by breaking down microphone signals into subbands using an auditory filterbank to separate concurrent speakers. Its core advancement is a novel speech onset detection approach, derived from speech and impulse response models, which is then used to calculate a multi-channel cross-correlation coefficient (MCCC) for each subband. These subband results are combined to estimate the direction-of-arrival (DOA) of each speaker.

The second proposed algorithm takes a different tack, extending the well-known Generalized Cross-Correlation with Phase Transform (GCC-PHAT) method. It leverages the redundant information available from multiple microphones in an array to specifically combat the distorting effects of reverberation. Both methods were rigorously evaluated under adverse conditions, including simulated environments with reverberation times (T60) up to 1 second and real recordings in a room with a T60 of approximately 0.65 seconds. Experimental results confirm that these new techniques can reliably locate both static and moving speakers, demonstrating superior performance compared to current state-of-the-art localization methods in these challenging acoustic scenarios.

Key Points
  • Uses novel speech onset detection and multi-channel cross-correlation to suppress reverberation effects.
  • Tested successfully in extreme conditions with reverberation times (T60) up to 1 second.
  • Outperforms existing state-of-the-art methods for locating both static and moving speakers in real, echo-heavy rooms.

Why It Matters

Enables reliable voice assistants, meeting transcription, and security systems in noisy, echo-prone real-world environments like auditoriums and factories.