Audio & Speech

Quality of Automatic Speech Recognition -- Polish Language case study -- from Wav2Vec to Scribe ElevenLabs

A new academic benchmark shows ElevenLabs' Scribe model outperforms OpenAI Whisper and other open-source models for Polish medical transcription.

Deep Dive

A team of researchers from AGH University of Science and Technology in Poland has released a comprehensive benchmark comparing modern Automatic Speech Recognition (ASR) systems for the Polish language. The study, titled 'Quality of Automatic Speech Recognition -- Polish Language case study -- from Wav2Vec to Scribe ElevenLabs,' specifically evaluated performance on medical interview data and general Polish benchmarks. It tested seven models, including a hybrid pipeline of OpenAI's Whisper model paired with a Large Language Model (LLM) for correction, and ElevenLabs' proprietary Scribe model. The other contenders were open-source, end-to-end architectures like QuartzNet, FastConformer, Wav2Vec 2.0 XLSR, and models from the ESPnet Model Zoo.

The evaluation measured accuracy using Word Error Rate (WER) and Character Error Rate (CER) across clean, bandwidth-limited, and degraded audio signals. The results were decisive: ElevenLabs' Scribe model achieved the best performance for Polish on both the general benchmarks and the specialized medical data. OpenAI's Whisper model, when enhanced with an LLM for post-processing, emerged as the strongest performer among the open-source options tested. This study fills a significant gap by providing rigorous, independent validation of commercial and open-source ASR tools for a major European language, offering developers concrete data to choose the right engine for Polish voice applications, particularly in high-stakes fields like healthcare documentation.

Key Points
  • ElevenLabs' Scribe model outperformed all others, including a Whisper+LLM combo, for Polish speech recognition accuracy (WER/CER).
  • The study tested 7 models on medical interviews and public datasets like Mozilla Common Voice, under various audio conditions.
  • OpenAI's Whisper was the top-performing open-source model, especially when its output was corrected by a separate Large Language Model.

Why It Matters

Provides essential benchmark data for developers building voice AI in Polish, crucial for healthcare, customer service, and accessibility tools.