ElevenLabs Scribe v2 achieved lowest overall WER (13.2%) and highest BERTScore (0.936) across all four code-switching language pairs?

ElevenLabs Scribe v2 achieved lowest overall WER (13.2%) and highest BERTScore (0.936) across all four code-switching language pairs.

Novel two-stage pipeline using GPT-4o and Gemini 1.5 Pro reduced LLM scoring costs by 91% compared to exhaustive scoring?

Novel two-stage pipeline using GPT-4o and Gemini 1.5 Pro reduced LLM scoring costs by 91% compared to exhaustive scoring.

BERTScore recommended over WER for Arabic and Persian due to transliteration variance penalizing semantically correct transcriptions?

BERTScore recommended over WER for Arabic and Persian due to transliteration variance penalizing semantically correct transcriptions.

Research & Papers

Benchmark reveals ElevenLabs Scribe v2 dominates code-switching ASR at 13.2% WER

arXiv cs.CL May 20, 2026

⚡Five commercial ASR systems tested on Arabic, Persian, and German code-switching speech.

Deep Dive

A new academic paper from researchers Sajjad Abdoli, Ghassan Al-Sumaidaee, and colleagues presents the first comprehensive benchmark of commercial automatic speech recognition (ASR) systems on code-switching speech — where speakers naturally alternate between two languages within a single utterance. The study evaluates five major ASR providers across four language pairs: Egyptian Arabic–English, Saudi Arabic (Najdi/Hijazi)–English, Persian (Farsi)–English, and German–English. Each dataset contains 300 carefully selected samples, chosen via a two-stage pipeline: a heuristic filter that scores transcripts on five structural code-switching signals, followed by an ensemble of GPT-4o and Gemini 1.5 Pro scoring candidates across six linguistic dimensions. This pipeline slashed LLM scoring costs by approximately 91% compared to exhaustive scoring, making large-scale code-switching benchmarks far more accessible.

The results are clear: ElevenLabs Scribe v2 dominated across all four language pairs, achieving a 13.2% overall Word Error Rate (WER) — with 13.1% on Egyptian Arabic specifically — and the highest BERTScore of 0.936. The researchers argue that BERTScore is a more reliable metric for Arabic and Persian pairs, where transliteration variance causes WER to penalize semantically correct transcriptions that use different script variants. Additionally, difficulty-stratified analysis revealed performance gaps that aggregate averages mask, and BERT embedding projections confirmed semantic proximity between references and hypotheses despite surface-level script differences. The full benchmarking dataset is publicly available on arXiv (arXiv:2605.19069), giving developers and researchers a valuable resource for evaluating real-world multilingual speech recognition.

Key Points

ElevenLabs Scribe v2 achieved lowest overall WER (13.2%) and highest BERTScore (0.936) across all four code-switching language pairs.
Novel two-stage pipeline using GPT-4o and Gemini 1.5 Pro reduced LLM scoring costs by 91% compared to exhaustive scoring.
BERTScore recommended over WER for Arabic and Persian due to transliteration variance penalizing semantically correct transcriptions.

Why It Matters

Code-switching is pervasive in real-world speech; better ASR benchmarks drive improvements for multilingual users and global applications.

Read Original Article

Benchmark reveals ElevenLabs Scribe v2 dominates code-switching ASR at 13.2% WER

Why It Matters

Related Articles

🚀 Stay Ahead in AI