Audio & Speech

ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models

New framework integrates audio-aware LLMs to correct speech recognition errors, outperforming standard baselines.

Deep Dive

A new research paper introduces ReHear, a novel framework designed to solve a core problem in semi-supervised automatic speech recognition (ASR): the error accumulation caused by noisy pseudo-labels. Traditional self-training methods use an ASR model's own imperfect transcripts to train itself, leading to confirmation bias where errors are reinforced. ReHear breaks this cycle by integrating an instruction-tuned, audio-aware large language model (LLM) directly into the training loop.

Technically, ReHear's innovation lies in its two-stage iterative process. First, an initial ASR model generates hypotheses (pseudo-labels) from unlabeled audio. Crucially, instead of using these directly, both the hypothesis and the source audio are fed to a specialized audio LLM. This model, conditioned on the actual acoustic signal, can correct severe recognition errors and recover phonetically accurate transcripts, creating high-fidelity refined labels. These improved labels are then used to fine-tune the ASR model, and the cycle repeats. The paper reports that this approach effectively mitigates error propagation and outperforms both standard supervised baselines and conventional pseudo-labeling methods across multiple benchmarks.

The context for this work is the high cost and scarcity of accurately transcribed speech data needed to train robust ASR systems. Semi-supervised learning, which leverages vast amounts of unlabeled audio, is essential but has been hampered by the quality of automatically generated training targets. ReHear's use of emerging audio LLMs as intelligent correctors represents a significant architectural shift. The practical implication is the potential to build more accurate speech recognition systems for diverse accents, noisy environments, and specialized vocabularies using primarily unlabeled data, reducing reliance on expensive human transcription.

Key Points
  • Integrates an audio-aware LLM to correct ASR pseudo-labels using both text hypothesis and source audio, recovering from severe errors.
  • Implements an iterative self-training loop where refined labels are used to fine-tune the ASR model, breaking the cycle of error propagation.
  • Demonstrated performance gains, consistently outperforming both supervised training and standard pseudo-labeling baselines across diverse benchmarks.

Why It Matters

Enables creation of more accurate speech recognition models using primarily unlabeled data, reducing dependency on costly human transcription.