Nobutaka Ito's cSTMM unifies speech separation models, beats baseline by 0.25dB
New mixture model connects four prior methods under one parameter and boosts SDRi across all acoustic conditions.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
In audio and speech processing, blind speech separation (BSS) aims to isolate individual speakers from a mixed recording without prior knowledge of the sources or the environment. Traditional mask-based BSS often relies on extracting phase and level difference features under specific assumptions like plane-wave propagation. Ito's work instead adopts a directional statistical approach that directly clusters normalized multichannel observations on the complex unit sphere.
The key innovation is cSTMM, a framework that unifies four previously disparate mixture models — the complex angular central Gaussian mixture model (cACGMM), complex Bingham mixture model (cBMM), complex Watson mixture model (cWMM) — through a single degrees-of-freedom parameter ν. This allows systematic exploration of how the tail behavior of the distribution affects separation quality. Ito derived a generalized minorization-maximization (MM) algorithm for parameter estimation. On noiseless, reverberated LibriSpeech data, the model achieved up to 0.25dB average SDRi improvement across all test conditions with ν=1, numerically recovering the other models at specific ν values.
- cSTMM generalizes four prior directional mixture models (cACGMM, cBMM, cWMM) using a single ν parameter.
- A single setting ν=1 outperformed the cACGMM equivalent (ν=M) in all test conditions with an average SDRi gain of 0.25dB.
- The method uses a generalized minorization-maximization (MM) procedure, enabling no-restart evaluation on reverberated LibriSpeech mixtures.
Why It Matters
A unified, simpler model that improves speech separation accuracy, beneficial for hearing aids, conferencing, and voice assistants.