Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite
Automated audit finds translated benchmarks like HellaSwag are riddled with errors, skewing global AI evaluations.
A new study by researchers Klaudia Thellmann, Bernhard Stadler, and Michael Färber exposes critical quality issues in machine-translated AI benchmarks. The team performed an automated quality assurance audit on the EU20 suite, which includes five established benchmarks translated into 20 languages. Their three-step method involved a structural corpus audit, quality profiling using the neural metric COMET to compare services like DeepL, ChatGPT, and Google Translate, and an LLM-based analysis to pinpoint error types at the span level.
The analysis revealed consistent, troubling trends: benchmarks with lower COMET scores, such as HellaSwag, exhibited a significantly higher share of accuracy and mistranslation errors. In contrast, the ARC dataset was found to be comparatively clean. This variance proves that translation noise is not uniform and can severely distort performance measurements for AI models evaluated across different languages. The work, accepted at LREC 2026, concludes that while automated QA provides scalable indicators to prioritize dataset review, it complements rather than replaces human gold standards.
In response to these findings, the researchers are releasing cleaned and corrected versions of the EU20 datasets alongside their reproducibility code. This practical output provides the community with more reliable tools for multilingual evaluation. The study underscores that the goal is not just to translate benchmarks, but to systematically measure and verify their reliability, ensuring global AI progress is assessed on a level playing field.
- Automated audit found high error rates in translated benchmarks, with HellaSwag notably flawed.
- Used COMET metrics and LLM analysis to compare translations from DeepL, ChatGPT, and Google.
- Researchers release corrected EU20 datasets and code to improve multilingual AI evaluation fairness.
Why It Matters
Flawed benchmarks distort global AI model rankings; this work provides tools for more reliable, equitable multilingual evaluation.