LL-SDR: Low-Latency Speech enhancement through Discrete Representations
A new framework uses discrete audio tokens to separate speech from noise with minimal processing delay.
A research team including Jingyi Li, Luca Della Libera, Mirco Ravanelli, and Cem Subakan has published a paper on arXiv introducing LL-SDR (Low-Latency Speech enhancement through Discrete Representations). This new framework tackles a core question in audio AI: whether converting continuous audio signals into discrete tokens can consistently improve the task of cleaning up noisy speech. Their approach explicitly leverages this discretization process to better separate the target speech from background noise.
LL-SDR's first key innovation is the Variance-Ordered Residual Vector Quantizer (VO-RVQ). This component is designed to disentangle the underlying distributions of speech and noise during the initial tokenization step, creating a more structured latent space. The second is a latent-space discriminator that helps align the 'cleaned' audio embeddings with high-quality semantic embeddings, ensuring the enhanced output sounds natural.
The results show LL-SDR outperforms traditional continuous-representation baselines and achieves performance on par with more computationally intensive autoregressive token-based models. Crucially, it does this while being lightweight and enabling low-latency processing, making it suitable for real-time applications like voice calls, hearing aids, or meeting transcription in both echo-filled (reverberant) and standard noisy environments. The team has made demos and the source code publicly available, inviting further development and application.
- Uses a novel Variance-Ordered Residual Vector Quantizer (VO-RVQ) to disentangle speech and noise distributions during tokenization.
- Incorporates a latent-space discriminator to better align enhanced audio embeddings with clean semantic embeddings.
- Matches performance of heavier autoregressive models while enabling lightweight, low-latency processing for real-time use.
Why It Matters
Enables clearer real-time communication in apps like calls, conferencing, and assistive listening devices with minimal delay.