[D] Releasing a professional MQM-annotated MT dataset (16 lang pairs, 48 annotators)
Professional dataset with 362 segments across 16 language pairs achieves 2.6x higher annotator agreement than typical WMT campaigns.
Alconost, a professional translation and localization company, has open-sourced a significant new resource for the machine translation (MT) community. Their 'mqm-translation-gold' dataset, hosted on Hugging Face, addresses a critical gap in the field: the lack of freely available, high-quality evaluation data. Unlike many existing test sets that rely on noisy crowdsourced annotations or are locked behind paywalls, this dataset features annotations from 48 professional linguists. It covers 362 translation segments across 16 language pairs, with each segment annotated using the full Multidimensional Quality Metrics (MQM) framework, which includes error category, severity, and span. The methodology strictly follows WMT (Workshop on Machine Translation) guidelines, ensuring compatibility with established research benchmarks.
The dataset's standout achievement is its high inter-annotator agreement (IAA), a key metric for data reliability. It achieved a Kendall's τ correlation of 0.317, which the creators note is approximately 2.6 times higher than what is typically reported in large-scale WMT evaluation campaigns. This significant improvement is attributed not to special annotators, but to consistent and thorough annotator training. The release includes multiple annotations per segment, enabling robust statistical analysis of agreement. For AI researchers and engineers, this dataset provides a new 'gold standard' to rigorously test the output quality of models like GPT-4o, Claude 3.5 Sonnet, or Meta's NLLB, moving beyond simplistic metrics like BLEU to detailed error analysis.
- Dataset contains 362 segments across 16 language pairs, annotated by 48 professional linguists using the MQM framework.
- Achieved a Kendall's τ score of 0.317 for inter-annotator agreement, ~2.6x higher than typical WMT campaign results.
- Open-sourced on Hugging Face to provide a free, high-quality benchmark for evaluating machine translation model accuracy.
Why It Matters
Provides a free, reliable benchmark for AI translation quality, enabling better model evaluation and development beyond basic automated scores.