Audio & Speech

Nobutaka Ito's cSTMM unifies speech separation models, beats baseline by 0.25dB

New mixture model connects four prior methods under one parameter and boosts SDRi across all acoustic conditions.

Deep Dive

In audio and speech processing, blind speech separation (BSS) aims to isolate individual speakers from a mixed recording without prior knowledge of the sources or the environment. Traditional mask-based BSS often relies on extracting phase and level difference features under specific assumptions like plane-wave propagation. Ito's work instead adopts a directional statistical approach that directly clusters normalized multichannel observations on the complex unit sphere.

The key innovation is cSTMM, a framework that unifies four previously disparate mixture models — the complex angular central Gaussian mixture model (cACGMM), complex Bingham mixture model (cBMM), complex Watson mixture model (cWMM) — through a single degrees-of-freedom parameter ν. This allows systematic exploration of how the tail behavior of the distribution affects separation quality. Ito derived a generalized minorization-maximization (MM) algorithm for parameter estimation. On noiseless, reverberated LibriSpeech data, the model achieved up to 0.25dB average SDRi improvement across all test conditions with ν=1, numerically recovering the other models at specific ν values.

Key Points
  • cSTMM generalizes four prior directional mixture models (cACGMM, cBMM, cWMM) using a single ν parameter.
  • A single setting ν=1 outperformed the cACGMM equivalent (ν=M) in all test conditions with an average SDRi gain of 0.25dB.
  • The method uses a generalized minorization-maximization (MM) procedure, enabling no-restart evaluation on reverberated LibriSpeech mixtures.

Why It Matters

A unified, simpler model that improves speech separation accuracy, beneficial for hearing aids, conferencing, and voice assistants.