Research & Papers

CroCo enables multilingual AI preference tuning without language-specific data

A single English reward model now improves LLM outputs across 14 languages

Deep Dive

Mike Zhang, Ali Basirat, and Desmond Elliott introduce CroCo, extending contrastive preference tuning to 14 high and low-resource languages. Using an English-only reward model atop multilingual bases (EuroLLM-9B, Aya-3B), the method transfers without language-specific annotations. On-policy data is critical; off-policy reduces gains. Structured task performance matches/exceeds baselines in most languages, while open-ended generation wins across 11 evaluated languages.

Key Points
  • CroCo uses an English-only reward model on multilingual base models to tune preferences across 14 languages without language-specific annotations.
  • On-policy data is essential; off-policy responses reduce benefits and online optimization fails to outperform the offline variant.
  • On open-ended generation tasks, CroCo-tuned models win against base models in all 11 evaluated languages for both EuroLLM-9B and Aya-3B.

Why It Matters

Eliminates need for language-specific preference annotation, making multilingual LLM alignment cheaper and more scalable.