New study: AI literary translation trades fluency for faithfulness
130K paragraphs from 106 novels in 16 languages reveal a consistent tradeoff...
A new paper by Sarah Griebel and Ted Underwood, accepted at NLP4DH 2026, examines the tension between fluency and faithfulness in literary translation across human and machine translators. Using a dataset of 130,486 translated paragraphs from 106 novels in 16 source languages, they compared translations by humans, Google Translate, and Google's TranslateGemma model. Fluency was measured via a translationese classifier trained on part-of-speech n-grams to detect how 'original-like' a translation reads, while faithfulness was assessed using COMET-KIWI, a reference-free automatic evaluation metric. The researchers controlled for paragraph length and found a consistent negative correlation between fluency and faithfulness for human translators and Google Translate — meaning that more fluent translations tended to sacrifice semantic accuracy. However, this tradeoff was weaker and often not statistically significant for TranslateGemma, suggesting that newer LLM-based systems may handle the balance differently.
The results highlight that segment length is a critical factor in automatic evaluation, as longer paragraphs can distort fluency and faithfulness scores. The study implies that literary translation quality assessment must account for both dimensions separately, and that current metrics may conflate them. For practitioners, this means that relying solely on fluency-based evaluations (like human ratings of 'naturalness') could miss significant faithfulness issues, especially in human and traditional machine translation. The weaker tradeoff in TranslateGemma points to potential advantages of LLM-based translation for literary works, but the paper stops short of declaring a winner. The findings are particularly relevant for publishers, localization teams, and researchers working on literary translation quality assessment.
- Analyzed 130,486 paragraphs from 106 novels in 16 source languages, comparing human, Google Translate, and TranslateGemma translations
- Found a consistent negative correlation between fluency (measured by translationese detection) and faithfulness (COMET-KIWI score) for human and Google Translate
- TranslateGemma showed a weaker, often non-significant tradeoff, suggesting LLM-based systems may balance fluency and faithfulness differently
Why It Matters
Shows that literary translation quality metrics must separate fluency from faithfulness, especially for AI systems.