Audio & Speech

LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models Using in-the-wild Data

New method cuts Word Error Rate by 3.8% and boosts translation scores by refining noisy speech labels.

Deep Dive

A team of researchers has developed a new framework called LESS (Large Language Model Enhanced Semi-supervised Learning) to tackle a major bottleneck in training speech AI: the poor quality of labels generated from real-world, noisy audio. Current state-of-the-art Speech Foundational Models struggle with the complex acoustics of uncurated data, producing unreliable text transcripts (pseudo-labels) that hinder effective semi-supervised learning. LESS innovates by using a Large Language Model as a post-processor to correct and refine these noisy pseudo-labels, followed by a data filtering strategy to select the highest-quality samples for training. This creates a cleaner, more effective training dataset from vast amounts of otherwise unusable audio.

The method proved highly effective across different languages and tasks. In evaluations, LESS achieved an absolute Word Error Rate (WER) reduction of 3.8% on the Mandarin WenetSpeech dataset for Automatic Speech Recognition (ASR). For Spanish-to-English Automatic Speech Translation (AST), it boosted BLEU scores by 0.8 and 0.7 points, reaching 34.0 on Callhome and 64.7 on the Fisher test sets. These consistent gains demonstrate the framework's versatility. By open-sourcing the recipe, accepted for presentation at ICASSP 2026, the researchers aim to accelerate progress in building more robust and accurate speech models that can perform reliably outside of controlled laboratory environments.

Key Points
  • Uses LLMs to correct noisy text labels from real-world speech data, enabling better semi-supervised training.
  • Achieved a 3.8% absolute WER reduction on Mandarin ASR and significant BLEU score gains for Spanish-English translation.
  • The open-source framework works across diverse languages and tasks, making 'in-the-wild' audio a viable training resource.

Why It Matters

It unlocks the potential of vast, unlabeled real-world audio to build more accurate and robust speech recognition and translation systems.