Research & Papers

Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite

arXiv cs.IR April 03, 2026

⚡Automated audit finds translated benchmarks like HellaSwag are riddled with errors, skewing global AI evaluations.

Deep Dive

A new study by researchers Klaudia Thellmann, Bernhard Stadler, and Michael Färber exposes critical quality issues in machine-translated AI benchmarks. The team performed an automated quality assurance audit on the EU20 suite, which includes five established benchmarks translated into 20 languages. Their three-step method involved a structural corpus audit, quality profiling using the neural metric COMET to compare services like DeepL, ChatGPT, and Google Translate, and an LLM-based analysis to pinpoint error types at the span level.

The analysis revealed consistent, troubling trends: benchmarks with lower COMET scores, such as HellaSwag, exhibited a significantly higher share of accuracy and mistranslation errors. In contrast, the ARC dataset was found to be comparatively clean. This variance proves that translation noise is not uniform and can severely distort performance measurements for AI models evaluated across different languages. The work, accepted at LREC 2026, concludes that while automated QA provides scalable indicators to prioritize dataset review, it complements rather than replaces human gold standards.

In response to these findings, the researchers are releasing cleaned and corrected versions of the EU20 datasets alongside their reproducibility code. This practical output provides the community with more reliable tools for multilingual evaluation. The study underscores that the goal is not just to translate benchmarks, but to systematically measure and verify their reliability, ensuring global AI progress is assessed on a level playing field.

Key Points

Automated audit found high error rates in translated benchmarks, with HellaSwag notably flawed.
Used COMET metrics and LLM analysis to compare translations from DeepL, ChatGPT, and Google.
Researchers release corrected EU20 datasets and code to improve multilingual AI evaluation fairness.

Why It Matters

Flawed benchmarks distort global AI model rankings; this work provides tools for more reliable, equitable multilingual evaluation.

Read Original Article

Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite

Why It Matters

Stay Ahead in AI