The Impact of Vocabulary Overlaps on Knowledge Transfer in Multilingual Machine Translation
New research reveals what truly drives knowledge transfer in multilingual translation models.
A new study from the University of Helsinki sheds light on a long-debated question in multilingual neural machine translation (MNMT): how much does vocabulary overlap actually matter for knowledge transfer? Oona Itkonen and Jörg Tiedemann designed systematic experiments comparing joint vocabularies (shared token sets across languages) against disjoint vocabularies (separate for each language), while also varying whether the auxiliary language was related or unrelated to the source language. To emphasize transfer effects, they used an out-of-domain setup where the training data for the auxiliary language came from a different domain than the source-target pair.
The results confirm that larger vocabulary overlaps—common between related languages—do boost performance. However, the study’s key finding is that domain-match between auxiliary and source-target language pairs, along with language relatedness, are more critical than a shared vocabulary. In some cases, using a disjoint vocabulary with a related language in the same domain outperformed a joint vocabulary with an unrelated language. This suggests that MNMT practitioners should prioritize selecting auxiliary languages that are genetically and domainically close to the target scenario, rather than obsessing over vocabulary token sharing. The paper is available on arXiv (2605.04196).
- Systematic experiments compared joint vs. disjoint vocabularies with related and unrelated auxiliary languages in an out-of-domain MNMT setup.
- Vocabulary overlaps typical of related languages improved results, but domain-match and language relatedness proved more important.
- Disjoint vocabulary with a related, domain-matched language can outperform joint vocabulary with an unrelated language.
Why It Matters
Optimizing MNMT model design: prioritize language relatedness and domain alignment over shared vocabulary engineering.