Analyzed 130,486 paragraphs from 106 novels in 16 source languages, comparing human, Google Translate, and TranslateGemma translations?

Analyzed 130,486 paragraphs from 106 novels in 16 source languages, comparing human, Google Translate, and TranslateGemma translations

Found a consistent negative correlation between fluency (measured by translationese detection) and faithfulness (COMET-KIWI score) for human and Google Translate?

Found a consistent negative correlation between fluency (measured by translationese detection) and faithfulness (COMET-KIWI score) for human and Google Translate

TranslateGemma showed a weaker, often non-significant tradeoff, suggesting LLM-based systems may balance fluency and faithfulness differently?

TranslateGemma showed a weaker, often non-significant tradeoff, suggesting LLM-based systems may balance fluency and faithfulness differently

Research & Papers

New study: AI literary translation trades fluency for faithfulness

arXiv cs.CL May 18, 2026

⚡130K paragraphs from 106 novels in 16 languages reveal a consistent tradeoff...

Deep Dive

A new paper by Sarah Griebel and Ted Underwood, accepted at NLP4DH 2026, examines the tension between fluency and faithfulness in literary translation across human and machine translators. Using a dataset of 130,486 translated paragraphs from 106 novels in 16 source languages, they compared translations by humans, Google Translate, and Google's TranslateGemma model. Fluency was measured via a translationese classifier trained on part-of-speech n-grams to detect how 'original-like' a translation reads, while faithfulness was assessed using COMET-KIWI, a reference-free automatic evaluation metric. The researchers controlled for paragraph length and found a consistent negative correlation between fluency and faithfulness for human translators and Google Translate — meaning that more fluent translations tended to sacrifice semantic accuracy. However, this tradeoff was weaker and often not statistically significant for TranslateGemma, suggesting that newer LLM-based systems may handle the balance differently.

The results highlight that segment length is a critical factor in automatic evaluation, as longer paragraphs can distort fluency and faithfulness scores. The study implies that literary translation quality assessment must account for both dimensions separately, and that current metrics may conflate them. For practitioners, this means that relying solely on fluency-based evaluations (like human ratings of 'naturalness') could miss significant faithfulness issues, especially in human and traditional machine translation. The weaker tradeoff in TranslateGemma points to potential advantages of LLM-based translation for literary works, but the paper stops short of declaring a winner. The findings are particularly relevant for publishers, localization teams, and researchers working on literary translation quality assessment.

Key Points

Analyzed 130,486 paragraphs from 106 novels in 16 source languages, comparing human, Google Translate, and TranslateGemma translations
Found a consistent negative correlation between fluency (measured by translationese detection) and faithfulness (COMET-KIWI score) for human and Google Translate
TranslateGemma showed a weaker, often non-significant tradeoff, suggesting LLM-based systems may balance fluency and faithfulness differently

Why It Matters

Shows that literary translation quality metrics must separate fluency from faithfulness, especially for AI systems.

Read Original Article

New study: AI literary translation trades fluency for faithfulness

Why It Matters

Related Articles

🚀 Stay Ahead in AI