L3-SE slashes linguistic hallucinations in speech enhancement
New distillation method cuts speech AI errors by 50% in noisy environments
Language model-based speech enhancement produces natural-sounding speech but often hallucinates linguistically incorrect outputs under severe noise. To address this, researchers introduce L3-SE (Language-model Speech Enhancement with Semantic Embeddings), a framework that learns noise-invariant acoustic-semantic representations. It distills two clean-speech targets: an acoustic target for reconstruction fidelity and a semantic target for linguistic consistency, then conditions a decoder-only autoregressive LM on these representations. A high-fidelity codec built on learnable weighted WavLM layer representations serves as the discrete acoustic interface, enabling high-quality generation.
Experiments demonstrate that L3-SE substantially reduces linguistic hallucination and improves content faithfulness compared to prior LM-based baselines. Gains are especially clear under low-SNR and reverberant conditions, while perceptual quality remains competitive. The framework consistently outperforms on linguistic consistency metrics, making speech enhancement more reliable in adverse acoustic environments. Audio samples are available online, and code will be released upon acceptance.
- L3-SE uses noise-invariant acoustic-semantic distillation from two clean-speech targets (acoustic + semantic).
- It conditions a decoder-only autoregressive LM with a high-fidelity WavLM-based codec for token prediction.
- Achieves consistent linguistic consistency improvements, particularly under low-SNR and reverberant conditions.
Why It Matters
Makes speech enhancement reliable for transcription and voice assistants in noisy environments.