TranslateGemma subtitles pass AI metrics but fail human review: 71% error blindness
AI translation metrics rated 84 subtitles 'clean', but humans found errors in 71% of them.
A recent benchmark comparing six LLMs on subtitle translation used two reference-free QE metrics—MetricX-24 (~13B mT5-XXL) and COMETKiwi (~10.7B XLM-R-XXL)—combined into a TQI index. TranslateGemma-12B ranked first across all language pairs. But the creators questioned whether the metrics were truly sensitive in the high-confidence zone. They conducted a human review of 84 translations (21 English subtitles into Spanish, Japanese, Thai, and Chinese) that passed both metrics' clean thresholds (MX < 5 AND CK ≥ 0.70).
Professional linguists applied full MQM annotation (Major/Minor severity, covering accuracy, fluency, style, terminology). Results: only 1 was auto-flagged by metrics, but 60/84 had at least one human-flagged error (13 major-only). The metric-blindness rate was 71% for any error and 14.5% for major errors. All 25 accuracy-class errors fell in the blind zone. Notably, Japanese had 10 of 15 mistranslations, all invisible to the metrics, despite having the highest mean COMETKiwi (0.863). The study is small (one model, one content set) but the numbers are a stark warning: automation can miss the majority of real-world flaws, especially in high-stakes domains like subtitling.
- TranslateGemma-12B produced 84 translations that passed AI quality thresholds, but human review found errors in 71% (60/84) of them.
- All 25 accuracy-class errors (mistranslation, omission, addition) were missed by both MetricX-24 and COMETKiwi (0% detection).
- Japanese had the highest COMETKiwi score (0.863) yet carried 10 of 15 mistranslations—all metric-blind.
Why It Matters
Automated translation metrics can systematically miss real errors, risking flawed subtitles at scale for professionals relying on AI.