MBR decoding beats beam search for Whisper speech recognition
New research shows sample-based MBR decoding outperforms beam search in ASR tasks...
A new paper by Yuu Jinnai revisits Minimum Bayes Risk (MBR) decoding for automatic speech recognition (ASR) and speech translation (ST). While beam search has long been the standard decoding method for speech-to-text tasks, recent work has shown that sample-based MBR decoding outperforms beam search in text-to-text generation (e.g., machine translation). Jinnai tests whether this advantage extends to audio—specifically, using OpenAI's Whisper models and their derivatives on English and Japanese datasets.
The results show that MBR decoding achieves higher accuracy than beam search in most experimental settings. The method is especially promising for offline ASR and ST tasks where high accuracy is critical. By generating multiple candidate hypotheses and selecting the one with the lowest expected risk under a utility function, MBR reduces errors compared to the greedy or beam-search approaches traditionally used in speech. The code is open-sourced, making it easy for practitioners to adopt.
- MBR decoding outperforms beam search for Whisper-based ASR on English and Japanese
- Tested on both ASR and speech translation tasks across multiple Whisper model variants
- Sample-based approach selects hypothesis with minimum expected risk, boosting accuracy
Why It Matters
This could improve accuracy in production speech systems without needing model retraining—just a smarter decoding strategy.