Audio & Speech

Lightweight speech enhancement guided target speech extraction in noisy multi-speaker scenarios

New lightweight AI model isolates single voices in crowded, noisy environments with significant quality improvements.

Deep Dive

A research team from multiple institutions has published a new paper introducing GTCRN, a lightweight speech enhancement model designed to dramatically improve target speech extraction (TSE) in challenging acoustic environments. While current TSE systems work well with simple one-speaker-plus-noise or two-speaker mixtures, they struggle in real-world scenarios with multiple overlapping speakers and background noise. The researchers' solution builds on their previous speaker embedding-free framework, SEF-PNet, and introduces two novel extensions: LGTSE and D-LGTSE.

LGTSE incorporates what the team calls 'noise-agnostic enrollment guidance.' It works by first denoising the input noisy speech before that speech interacts with an enrollment sample (a short reference clip of the target speaker's voice). This pre-cleaning step reduces noise interference in the critical context-matching phase. D-LGTSE goes further to improve robustness against speech distortion by using the denoised speech as an additional noisy input during model training. This expands the range of noisy conditions the model learns from, allowing it to handle distorted signals more effectively.

The team employed a sophisticated two-stage training strategy. First, they pre-trained the model using GTCRN's enhancement guidance, then performed joint fine-tuning to fully optimize performance. The results on the standard Libri2Mix benchmark dataset are compelling: a 0.89 decibel improvement in Scale-Invariant Signal-to-Distortion Ratio (SISDR), a 0.16 point gain in Perceptual Evaluation of Speech Quality (PESQ), and a 1.97% increase in Short-Time Objective Intelligibility (STOI). These metrics translate to clearer, more intelligible extracted speech in crowded, noisy settings where current systems falter.

Key Points
  • GTCRN model improves target speech extraction by 0.89 dB SISDR on Libri2Mix dataset
  • Uses two-stage training with enhancement-guided pre-training then joint fine-tuning
  • D-LGTSE variant increases robustness by training on both original and denoised speech

Why It Matters

Enables clearer voice isolation in real-world settings like crowded meetings, conferences, and public spaces for better audio applications.