Audio & Speech

SenSE: Semantic-Aware High-Fidelity Universal Speech Enhancement

A new two-stage framework uses a language model to guide speech enhancement, preserving context and meaning.

Deep Dive

A research team from Northwestern Polytechnical University and Xiamen University has introduced SenSE, a novel framework for Universal Speech Enhancement (USE) that prioritizes semantic fidelity. Unlike previous generative models that often produce acoustically clean but semantically inconsistent audio, SenSE employs a two-stage architecture. First, a language model analyzes the degraded speech to extract semantic priors—understanding the intended words and context. Second, a flow matching-based generative model uses these priors to guide the enhancement process, ensuring the cleaned speech remains true to the original meaning. This method directly addresses a key weakness in existing AI audio cleanup tools.

The system's innovation extends to its training strategy, called dual-path masked conditioning. This allows the model to flexibly integrate multiple conditioning signals: the noisy audio input, the extracted semantic tokens, and optional high-quality reference speech. This multi-source approach grants SenSE remarkable adaptability to various types of audio distortion, from background noise to reverberation. Experimental results confirm it achieves state-of-the-art performance among generative speech enhancement models, showing a particularly high "performance ceiling" when dealing with severely degraded recordings. The work has been accepted for presentation at the ICME 2026 conference, and the researchers have made the code and audio demos publicly available for testing.

Key Points
  • Uses a language model in a two-stage process to guide audio generation with semantic understanding, preserving context.
  • Employs a dual-path masked conditioning training strategy to integrate signals from noisy audio, semantic tokens, and reference speech.
  • Demonstrates state-of-the-art results, especially under challenging distortion conditions, with code and demos released for public use.

Why It Matters

This enables more reliable AI audio cleanup for calls, recordings, and assistive tech, where understanding the message is as crucial as clarity.