Audio & Speech

Speech Enhancement Based on Drifting Models

A new framework denoises audio in a single step, outperforming iterative diffusion.

Deep Dive

Researchers from Victoria University of Wellington, Aalborg University, and GN Audio have introduced DriftSE (Speech Enhancement based on Drifting Models), a novel generative framework that rethinks audio denoising as an equilibrium problem. Unlike traditional diffusion models that require iterative sampling steps, DriftSE achieves one-step inference by evolving the pushforward distribution of a mapping function to directly match the clean speech distribution. This evolution is driven by a Drifting Field—a learned correction vector that guides samples toward high-density regions of the clean distribution. The framework naturally supports training on unpaired data by matching distributions rather than paired samples, a significant practical advantage.

DriftSE is explored in two formulations: a direct mapping from the noisy observation, and a stochastic conditional generative model from a Gaussian prior. Experiments on the VoiceBank-DEMAND benchmark demonstrate that DriftSE achieves high-fidelity enhancement in a single step, outperforming multi-step diffusion baselines. The paper, submitted to arXiv on April 27, 2026, includes 6 pages and 2 figures, and is categorized under Sound, AI, Audio and Speech Processing, and Signal Processing. This work establishes a new paradigm for speech enhancement, offering a faster, more efficient alternative to current diffusion-based methods.

Key Points
  • DriftSE formulates denoising as an equilibrium problem, enabling one-step inference instead of iterative sampling.
  • The Drifting Field is a learned correction vector that guides samples toward clean speech distribution, supporting unpaired training.
  • On VoiceBank-DEMAND, DriftSE outperforms multi-step diffusion baselines, achieving high-fidelity enhancement in a single step.

Why It Matters

DriftSE offers a faster, more efficient speech denoising method, potentially enabling real-time audio enhancement in consumer devices.