Sebastian Braun's mean flow model restores speech in real-time with 120x less compute
New speech restoration model runs in real-time with 120x less compute than SOTA
Sebastian Braun introduces a novel real-time speech restoration model based on Data Prediction Mean Flows, a variant of flow matching generative models. Traditional large offline processing models excel at tasks like bandwidth extension, gap filling, and removing non-linear artifacts from codecs, clipping, and distortion, but they are not real-time capable due to high latency and compute requirements. Braun's model addresses this by combining a few-step flow matching approach with a low-latency architecture, achieving no algorithmic latency beyond the STFT (short-time Fourier transform) processing.
The key innovation is the mean flow formulation that reduces computational cost by 120x compared to state-of-the-art methods while maintaining comparable audio quality. This makes it practical for real-time deployment in communication systems, hearing aids, and voice assistants. The model can restore speech degraded by various non-linear distortions without needing linear denoising or dereverberation, filling a gap in real-time audio processing. The paper is available on arXiv (2605.16251) and could impact live speech enhancement in telephony and streaming.
- Uses Data Prediction Mean Flows, a flow matching model, for speech restoration tasks like bandwidth extension and artifact removal
- Achieves 120x less compute than state-of-the-art with similar audio quality
- Introduces zero algorithmic latency beyond STFT, enabling real-time processing in low-latency systems
Why It Matters
Enables high-quality real-time speech restoration for live calls and streaming with drastically lower compute requirements.