Audio & Speech

DiffVQE: First diffusion model beats Microsoft's DeepVQE at echo cancellation

Outperforms discriminative models on echo and noise while being smaller and faster.

Deep Dive

In a new preprint submitted to Interspeech 2026, researchers from academia (including Pejman Mowlaee and Tim Fingscheidt) present DiffVQE, the first diffusion-based model for joint acoustic echo cancellation and background noise suppression in hands-free systems and speakerphones. Prior state-of-the-art was dominated by discriminative end-to-end models, particularly Microsoft's DeepVQE, which excelled in the ICASSP 2023 AEC Challenge. DiffVQE leverages a hybrid diffusion framework trained on the diverse, high-quality dataset from the Interspeech 2025 URGENT Challenge, enabling it to outperform DeepVQE across echo reduction, noise suppression, and computational efficiency.

The model is fully reproducible—the authors provide topology, training data, and framework details—a key differentiator in generative speech enhancement research. While DiffVQE is non-causal (no real-time constraint), its performance gains suggest that generative methods can now compete with or beat discriminative approaches in voice quality enhancement. The authors demonstrate that diffusion-based methods not only achieve superior echo/noise control but also require fewer parameters and less computation, making them practical for future deployment in conference systems and voice assistants.

Key Points
  • First diffusion-based AEC model that is fully reproducible (topology, data, training framework).
  • Outperforms Microsoft's DeepVQC in echo/noise control while being computationally lighter and smaller.
  • Trained on the Interspeech 2025 URGENT Challenge dataset for diverse, high-quality speech enhancement.

Why It Matters

Diffusion models can now beat top discriminative models for echo cancellation, enabling clearer calls and voice commands.