Audio & Speech

Robust Pitch Estimation and Tracking for Speakers Based on Subband Encoding and the Generalized Labeled Multi-Bernoulli Filter

A novel audio processing method combines subband encoding with a multi-target tracking filter to isolate speaker pitch.

Deep Dive

Researcher Shoufeng Lin has published a new paper introducing a robust method for estimating and tracking the fundamental frequency (pitch) of speakers in challenging audio environments. The core innovation is a two-stage pipeline that first decomposes audio using a biologically-inspired subband filterbank, then applies a sophisticated multi-target tracking filter originally used in radar/sonar systems. Instead of using a fixed number of frequency bands, the method introduces a novel 'frequency coverage metric' to dynamically configure the filterbank, optimizing it for the sparsity of human speech in the time-frequency domain.

For the tracking stage, the system adapts a Generalized Labeled Multi-Bernoulli (GLMB) filter—a powerful tool for tracking multiple objects when their number is unknown and varies over time. This is applied to pitch 'targets,' using a novel state transition model based on the Ornstein-Uhlenbeck process to smooth pitch contours and a measurement-driven model to handle new speakers entering the scene. The result is a system that can maintain the identity of individual speakers and their pitch trajectories even as they are intermittently obscured by noise or other talkers.

Experimental results show the proposed method outperforms several state-of-the-art pitch estimation techniques across various types of additive noise. Crucially, it also maintains robustness in real-world recordings from a reverberant room, a common failure point for many audio processing algorithms. This combination of auditory-inspired front-end processing with advanced Bayesian tracking logic represents a significant step forward for computational audio scene analysis.

Key Points
  • Uses a novel subband encoding front-end, dynamically configured via a new 'frequency coverage metric' instead of fixed parameters.
  • Adapts a Generalized Labeled Multi-Bernoulli (GLMB) filter—a multi-target tracker from radar—to track pitch 'targets' and maintain speaker identity.
  • Demonstrated superior accuracy over existing methods in tests with additive noise and real recordings in reverberant rooms.

Why It Matters

Enables more reliable voice AI, speaker diarization, and hearing aids in noisy, real-world environments like crowded rooms.