DiT-Flow: Speech Enhancement Robust to Multiple Distortions based on Flow Matching in Latent Space and Diffusion Transformers
A new speech model uses a Diffusion Transformer backbone and LoRA to handle multiple distortions with minimal compute.
A research team from Johns Hopkins University and other institutions has introduced DiT-Flow, a novel speech enhancement (SE) framework designed to tackle the real-world problem of cleaning up audio corrupted by multiple simultaneous distortions. The model is built on a generative flow matching approach, which operates on compact latent features derived from a variational autoencoder (VAE), and uses a Diffusion Transformer (DiT) as its core backbone. This architecture was trained and validated on the StillSonicSet, a large, synthetic yet acoustically realistic dataset built from LibriSpeech, FSD50K, and simulated room acoustics from Matterport3D scenes. The goal was to create a model robust to a wide mix of noise, reverberation, and compression artifacts, moving beyond the narrow conditions typical of most SE research.
The key innovation for efficiency is DiT-Flow's integration of Low-Rank Adaptation (LoRA) within a Mixture of Experts (MoE) framework. This hybrid approach allows the model to achieve high performance while activating only a small, task-specific subset of its parameters for any given input. The paper reports that this method uses just 4.9% of the total parameters to obtain superior results when evaluated on five completely unseen distortion types, demonstrating remarkable generalization. This parameter efficiency addresses a major bottleneck in deploying robust AI audio models, making high-quality speech enhancement more feasible for real-time applications on devices with limited computational resources.
Experiments show that DiT-Flow consistently outperforms current state-of-the-art generative speech enhancement models. The success highlights the effectiveness of combining flow matching for generative refinement with the scalable architecture of Diffusion Transformers, all while maintaining efficiency through modern parameter-tuning techniques. This work represents a significant step toward AI audio systems that can perform reliably in the messy, unpredictable acoustic environments of the real world.
- Built on a flow matching framework using a latent Diffusion Transformer (DiT) backbone for robust speech enhancement.
- Uses a LoRA + Mixture of Experts (MoE) setup for efficiency, activating only 4.9% of parameters for high performance on unseen distortions.
- Trained and validated on the large, synthetic StillSonicSet to handle multiple real-world distortions like noise, reverb, and compression simultaneously.
Why It Matters
Enables clearer voice calls and audio in noisy real-world environments using a fraction of the computational cost of traditional models.