I benchmarked 31 STT models on medical audio — VibeVoice 9B is the new open-source leader at 8.34% WER, but it's big and slow
A new benchmark of 31 speech models reveals Whisper's text normalizer had bugs inflating error rates by 2-3%.
An extensive third-party benchmark of 31 speech-to-text models on medical audio has crowned Microsoft's VibeVoice-ASR 9B as the new open-source accuracy champion. The model achieved a Word Error Rate of 8.34% on the PriMock57 dataset of British English doctor-patient consultations, coming remarkably close to the leading API model, Google's Gemini 2.5 Pro, which scored 8.15%. However, this performance comes with significant computational costs: VibeVoice is a 9-billion-parameter model requiring approximately 18GB of VRAM and, even when run on an H100 GPU, processes files at a sluggish 97 seconds each. For comparison, NVIDIA's Parakeet TDT 0.6B v3 completed the same task in just 6 seconds on Apple Silicon, albeit with a higher 9.35% WER.
The benchmark's most consequential finding wasn't about a single model, but a systemic flaw in a widely used evaluation tool. The researcher discovered two critical bugs in Whisper's EnglishTextNormalizer that had been artificially inflating WER scores across every model tested by approximately 2-3%. The first bug incorrectly treated the interjection "oh" as the digit zero, creating thousands of false errors in medical dialogue. The second bug failed to normalize common word variants like "ok/okay" or "yeah/yes," counting them as mistakes. All results in the v3 benchmark have been recalculated using a new, open-source custom normalizer, providing a more accurate baseline for the field. The full leaderboard, which includes models from ElevenLabs, NVIDIA, Mistral, and others, is publicly available on GitHub.
- Microsoft VibeVoice-ASR 9B leads open-source models with 8.34% WER, nearly matching Gemini 2.5 Pro's 8.15%.
- Critical bugs in Whisper's text normalizer were found to inflate WER scores by 2-3% across all 31 models tested.
- The 9B-parameter VibeVoice model requires ~18GB VRAM and runs slowly at 97s/file, highlighting a trade-off between accuracy and efficiency.
Why It Matters
This benchmark provides crucial, corrected performance data for developers choosing STT models for healthcare applications, where accuracy is paramount.