Research & Papers

We benchmarked TranslateGemma against 5 other LLMs on subtitle translation across 6 languages. At first glance the numbers told a clean story, but then human QA added a chapter. [D]

r/MachineLearning April 15, 2026

⚡The model topped benchmarks across 6 languages, but linguists discovered it was ignoring Traditional Chinese requests.

Deep Dive

A comprehensive benchmark pitted six leading language models against each other in a practical task: translating English subtitles into Spanish, Japanese, Korean, Thai, and both Simplified and Traditional Chinese. The models tested included Google's TranslateGemma-12b, Anthropic's Claude Sonnet 4.6, DeepSeek-V3.2, Google's Gemini 3.1 Flash Lite, and OpenAI's GPT-5.4-mini and GPT-5.4-nano. Using a custom Translation Quality Index (TQI) that combined the COMETKiwi fluency metric and MetricX-24 fidelity score, TranslateGemma-12b emerged as the clear winner with an average TQI of 0.6335, outperforming competitors like Gemini Flash Lite (0.5981) and the GPT-5.4 variants.

However, the story took a dramatic turn during human quality assurance. Linguists reviewing the Traditional Chinese (zh-TW) output discovered that TranslateGemma was systematically ignoring the locale tag and outputting Simplified Chinese instead. A follow-up test using the explicit 'zh-Hant' tag showed 76% of segments were still incorrectly Simplified. This was a critical failure that the automated metrics, MetricX-24 (a Google metric) and COMETKiwi, scored identically and highly, revealing a complete blind spot. The issue is a confirmed bias from TranslateGemma's fine-tuning corpus, which is heavily skewed toward Simplified Chinese data. The benchmark highlights the persistent gap between automated scores and real-world usability, especially for nuanced localization tasks where cultural and script fidelity is paramount.

Key Points

TranslateGemma-12b ranked #1 in a 6-language translation test with a TQI score of 0.6335, beating models like GPT-5.4-mini and Claude Sonnet.
Human QA uncovered a major flaw: the model output Simplified Chinese for 76% of Traditional Chinese requests, a failure missed by automated metrics.
The incident exposes a critical training data bias and underscores the necessity of human evaluation alongside benchmark scores for real-world deployment.

Why It Matters

This case proves automated benchmarks are insufficient for evaluating AI; human review remains essential to catch critical, real-world failures.

Read Original Article

We benchmarked TranslateGemma against 5 other LLMs on subtitle translation across 6 languages. At first glance the numbers told a clean story, but then human QA added a chapter. [D]

Why It Matters

Stay Ahead in AI