NLE: Non-autoregressive LLM-based ASR by Transcript Editing
New non-autoregressive model achieves 1630 RTFx, making real-time transcription dramatically more efficient.
IBM researchers have introduced NLE (Non-autoregressive LLM-based ASR by Transcript Editing), a novel approach to automatic speech recognition that breaks from traditional sequential decoding. While current autoregressive LLM-based ASR systems offer strong accuracy, their word-by-word generation creates latency bottlenecks. NLE reframes the problem as conditional transcript editing, allowing for fully parallel prediction. The system first extracts acoustic embeddings and generates an initial hypothesis using a pretrained speech encoder, then refines this draft with a bidirectional LLM editor trained via a sophisticated latent alignment objective.
A key innovation is the 'interleaved padding strategy,' which exploits the identity mapping bias inherent in Transformer architectures. This technique allows the model to focus computational resources on correcting errors rather than reconstructing the entire transcript from scratch. The result is dramatically improved efficiency without sacrificing accuracy. On the Open ASR leaderboard, the enhanced NLE++ variant achieves a competitive 5.67% average Word Error Rate (WER) while maintaining an exceptional inverse real-time factor (RTFx) of 1630.
In practical benchmarks, the performance gains are substantial. For single-utterance transcription tasks, NLE demonstrates a 27x speedup compared to equivalent autoregressive baseline models. This combination of high accuracy and massively parallel processing makes the architecture particularly suitable for latency-sensitive, real-time applications like live captioning, voice assistants, and meeting transcription where immediate feedback is critical.
- Achieves 27x speedup over autoregressive baselines for single-utterance transcription
- Uses novel 'interleaved padding strategy' to exploit Transformer identity mapping bias
- Scores 5.67% average WER with RTFx of 1630 on Open ASR leaderboard
Why It Matters
Enables real-time, accurate speech recognition for live captioning, voice assistants, and meeting transcription with dramatically lower latency.