Research & Papers

Benchmark reveals ElevenLabs Scribe v2 dominates code-switching ASR at 13.2% WER

Five commercial ASR systems tested on Arabic, Persian, and German code-switching speech.

Deep Dive

A new academic paper from researchers Sajjad Abdoli, Ghassan Al-Sumaidaee, and colleagues presents the first comprehensive benchmark of commercial automatic speech recognition (ASR) systems on code-switching speech — where speakers naturally alternate between two languages within a single utterance. The study evaluates five major ASR providers across four language pairs: Egyptian Arabic–English, Saudi Arabic (Najdi/Hijazi)–English, Persian (Farsi)–English, and German–English. Each dataset contains 300 carefully selected samples, chosen via a two-stage pipeline: a heuristic filter that scores transcripts on five structural code-switching signals, followed by an ensemble of GPT-4o and Gemini 1.5 Pro scoring candidates across six linguistic dimensions. This pipeline slashed LLM scoring costs by approximately 91% compared to exhaustive scoring, making large-scale code-switching benchmarks far more accessible.

The results are clear: ElevenLabs Scribe v2 dominated across all four language pairs, achieving a 13.2% overall Word Error Rate (WER) — with 13.1% on Egyptian Arabic specifically — and the highest BERTScore of 0.936. The researchers argue that BERTScore is a more reliable metric for Arabic and Persian pairs, where transliteration variance causes WER to penalize semantically correct transcriptions that use different script variants. Additionally, difficulty-stratified analysis revealed performance gaps that aggregate averages mask, and BERT embedding projections confirmed semantic proximity between references and hypotheses despite surface-level script differences. The full benchmarking dataset is publicly available on arXiv (arXiv:2605.19069), giving developers and researchers a valuable resource for evaluating real-world multilingual speech recognition.

Key Points
  • ElevenLabs Scribe v2 achieved lowest overall WER (13.2%) and highest BERTScore (0.936) across all four code-switching language pairs.
  • Novel two-stage pipeline using GPT-4o and Gemini 1.5 Pro reduced LLM scoring costs by 91% compared to exhaustive scoring.
  • BERTScore recommended over WER for Arabic and Persian due to transliteration variance penalizing semantically correct transcriptions.

Why It Matters

Code-switching is pervasive in real-world speech; better ASR benchmarks drive improvements for multilingual users and global applications.