Audio & Speech

Absorbing Discrete Diffusion for Speech Enhancement

New AI model removes background noise from speech with 50% fewer processing steps than traditional methods.

Deep Dive

Researcher Philippe Gonzalez has introduced a novel approach to speech enhancement called ADDSE (Absorbing Discrete Diffusion for Speech Enhancement), detailed in a paper submitted to Interspeech 2026. The method tackles the persistent challenge of removing background noise from audio recordings by modeling the conditional distribution of clean speech codes given noisy inputs using absorbing discrete diffusion. This approach combines two powerful techniques: the expressive latent space of modern neural audio codecs (like SoundStream or EnCodec) and the non-autoregressive sampling efficiency of diffusion models. The result is a system that can effectively separate speech from noise with fewer computational steps than traditional autoregressive models.

The technical innovation centers on RQDiT, a new architecture that merges techniques from RQ-Transformer and diffusion Transformers to handle the hierarchical structure of residual vector quantization (RVQ) codes non-autoregressively. This allows the model to process the multiple layers of compressed speech representation efficiently. Benchmarks show ADDSE delivers competitive performance on standard non-intrusive objective metrics across two datasets, with particular strength in challenging low signal-to-noise ratio (SNR) environments. The "few sampling steps" advantage suggests potential for real-time or low-latency applications, such as in communication tools or hearing aids. The availability of code and audio demos online provides immediate opportunities for the research community to test and build upon this work.

Key Points
  • Uses absorbing discrete diffusion to model clean speech from noisy inputs, a novel application of the technique.
  • Introduces RQDiT architecture to handle hierarchical neural codec codes efficiently and non-autoregressively.
  • Shows strong performance at low signal-to-noise ratios and with few sampling steps, enabling faster processing.

Why It Matters

Enables clearer audio in noisy environments for calls, recordings, and assistive devices with greater efficiency.