Self-Speculative Decoding for LLM-based ASR with CTC Encoder Drafts
A new method from IBM Research cuts AI speech recognition time by 78% while setting accuracy records.
A team from IBM Research, including George Saon and Samuel Thomas, has introduced a novel technique called 'Self-Speculative Decoding' that dramatically accelerates Large Language Model (LLM)-based Automatic Speech Recognition (ASR). The core innovation is using a smaller, faster Connectionist Temporal Classification (CTC) encoder as a 'draft' model to predict likely transcriptions. This draft is then efficiently verified or corrected by the main LLM in a single forward pass, bypassing the slower, token-by-token autoregressive decoding typically required. This hybrid approach allows the system to skip extensive computation for easy-to-predict audio segments while maintaining high accuracy for complex ones.
Tested across nine corpora in five languages, the method set a new benchmark. Using a 1B parameter LLM and a 440M parameter CTC encoder, it achieved a state-of-the-art 5.58% Word Error Rate (WER) on the HuggingFace Open ASR Leaderboard. Crucially, it improved the inverse real-time factor—a measure of decoding speed—by 4.4 times. This means the system can transcribe speech nearly 80% faster than before, with only a minimal 12% relative increase in WER compared to the slower, full autoregressive search. The code and model weights have been released under a permissive license, enabling immediate industry and research application.
- Achieved a record 5.58% Word Error Rate (WER) on the Open ASR benchmark using a 1B parameter LLM.
- Sped up AI speech decoding by a factor of 4.4x (78% faster) using a 440M parameter CTC encoder as a draft model.
- Released publicly under a permissive license, allowing for immediate integration into commercial and research speech systems.
Why It Matters
This breakthrough makes real-time, highly accurate speech-to-text for applications like live captioning and voice assistants significantly more efficient and cost-effective.