StuPASE: Towards Low-Hallucination Studio-Quality Generative Speech Enhancement
New model outperforms state-of-the-art methods by combining flow-matching with dry target training.
A research team including Xiaobin Rong, Jun Gao, and four other authors has introduced StuPASE, a breakthrough generative speech enhancement model that addresses the persistent trade-off between audio quality and hallucination. Building upon the robust but quality-limited PASE framework, StuPASE achieves what the researchers term "studio-level" perceptual quality while retaining the low-hallucination properties essential for reliable speech processing. The model represents a significant advancement in making AI-powered audio enhancement both trustworthy and high-fidelity.
The innovation comes from two key architectural changes. First, the team discovered that fine-tuning the model with completely dry audio targets—rather than targets containing simulated early reflections—substantially improves dereverberation performance. Second, to overcome limitations under extreme additive noise conditions, they replaced PASE's GAN-based generative module with a flow-matching module, enabling the system to generate clean, studio-quality speech even in highly adverse acoustic environments. Experimental results demonstrate that StuPASE consistently outperforms state-of-the-art speech enhancement methods across multiple challenging scenarios.
This research, submitted to Interspeech 2026, provides both technical details and audio demonstrations showing the model's capabilities. The work addresses a critical need in applications ranging from voice communication platforms and hearing aids to audio forensics and content creation, where both clarity and accuracy are paramount. By solving the hallucination problem while delivering superior audio quality, StuPASE sets a new benchmark for what's possible in generative speech enhancement technology.
- Replaces GAN-based generation with flow-matching for superior performance under strong noise conditions
- Fine-tuned with dry audio targets instead of simulated reflections, improving dereverberation significantly
- Outperforms current state-of-the-art speech enhancement methods while maintaining low hallucination rates
Why It Matters
Enables reliable, high-quality audio enhancement for calls, content creation, and assistive tech without AI adding false content.