Audio & Speech

Revisiting ASR Error Correction with Specialized Models

A compact seq2seq model beats large language models at correcting speech recognition errors, achieving 1.5% WER.

Deep Dive

A research team from Meta and Apple has published a paper titled 'Revisiting ASR Error Correction with Specialized Models,' challenging the current trend of using large language models (LLMs) for fixing speech recognition errors. The authors argue that while LLMs are powerful, they introduce significant latency and hallucination problems when applied to ASR correction. Instead, they propose returning to compact, task-specific sequence-to-sequence (seq2seq) models. Their key innovation is a massive, scalable training method that constructs synthetic corpora by cascading Text-to-Speech (TTS) and ASR systems to generate realistic error patterns, which they found crucial for performance.

Their proposed system uses a 'correction-first decoding' pipeline. First, the compact correction model generates candidate corrections. These candidates are then rescored using the original ASR system's acoustic confidence scores. The results are striking: with just 15x fewer parameters than typical LLMs, their model achieves a 1.5% Word Error Rate (WER) on the LibriSpeech test-clean benchmark and 3.3% on the more challenging test-other set. This outperforms LLM-based approaches.

Crucially, the model excels in the low-error regime—where an ASR transcript is already mostly correct—a scenario where LLMs are prone to introducing new errors through hallucinations. The research demonstrates strong generalization, showing the correction model works effectively across three distinct ASR backbones: Connectionist Temporal Classification (CTC), seq2seq, and Transducer models, as well as across diverse audio domains.

Key Points
  • Achieves 1.5% WER on LibriSpeech test-clean with a model 15x smaller than LLMs, using a novel 'correction-first decoding' method.
  • Solves the LLM hallucination problem in low-error scenarios by specializing on realistic ASR error patterns from synthetic data.
  • Generalizes across three core ASR architectures (CTC, Seq2seq, Transducer) and diverse audio domains without retraining.

Why It Matters

Enables faster, more accurate, and reliable speech-to-text for applications like meeting transcription, voice assistants, and captioning without LLM costs.