Audio & Speech

Predictive-Generative Drift Decomposition for Speech Enhancement and Separation

A plug-and-play hybrid that trains only on clean speech for universal enhancement

Deep Dive

Current speech enhancement and separation methods often fall into two camps: predictive models (fast, degradation-specific) and generative models (high perceptual quality but slow and task-bound). Hybrid approaches typically require custom architectures and are tied to particular predictors or noise types, limiting reuse. To address this, Julius Richter and colleagues from Mitsubishi Electric Research Labs (MERL) and other institutions introduce SIPS (Stochastic Interpolant Prior for Speech), a plug-and-play framework that seamlessly integrates any pretrained predictive model into a generative sampling process. The key innovation is decomposing the interpolation dynamics into a task-specific drift from the predictor and a stochastic denoising component from a generative score model. This allows the predictor to steer the output toward a task-consistent estimate while the score model preserves perceptual naturalness.

SIPS trains its score model exclusively on clean speech, making it degradation-agnostic and reusable across diverse tasks like denoising, dereverberation, and speaker separation. During inference, the predictor provides a deterministic drift that guides the generative sampler without requiring retraining or fine-tuning. Experiments with recent predictors SEMamba and FlexIO show consistent improvements, with gains up to +1.0 NISQA for speech separation—a standard perceptual metric. The framework generalizes across additive degradation tasks and offers a theoretically grounded alternative to ad‑hoc hybrid designs. Submitted to NeurIPS 2026, SIPS promises to democratize high-quality speech processing by enabling practitioners to plug their preferred predictor into a generative enhancement pipeline with minimal effort.

Key Points
  • SIPS builds on stochastic interpolants to decompose the sampling process into a predictor-driven drift and a generative denoising component
  • Trained solely on clean speech, the score model serves as a degradation-agnostic prior that works with any pretrained predictor
  • Achieves up to +1.0 NISQA improvement in perceptual quality for speech separation using SEMamba and FlexIO predictors

Why It Matters

Enables high-quality speech enhancement without task-specific retraining, critical for real-time communication and hearing aids.