Research & Papers

New metric and C-3PO alignment fix LLM cultural flip-flopping across languages

LLMs give different answers in English vs. Spanish for the same persona — here's why.

Deep Dive

Multilingual LLMs (like GPT-4 and LLaMA) often give culturally inconsistent answers when the same user persona is prompted in different languages. For example, with a fixed British persona and a question about famous literature, the English prompt yields Shakespeare, while the Spanish prompt yields Cervantes. This "cross-lingual cultural inconsistency" undermines trust in global AI applications. To quantify the problem, the authors introduce Singleton Fleiss's κ_S, a metric robust to hallucinations, which measures how much a model's persona-adherence changes across languages.

To fix this, they propose C-3PO (Cross-lingual Cultural Consistent Preference Optimisation), a consensus-driven alignment framework. C-3PO achieves up to a 0.10-point absolute increase in κ_S over unaligned models, outperforming strong prompting and representation steering baselines. The inconsistency disproportionately affects lower-resource languages like Indonesian and Persian. Layer-wise interpretability reveals that models implicitly personalize outputs toward the prompt language's stereotypical culture as forward-pass representations stabilize—early decoding of intermediate layers shows the bias emerging before the final output.

Key Points
  • New metric κ_S (Singleton Fleiss's) robustly quantifies cultural inconsistency across languages, resistant to hallucination artifacts.
  • C-3PO alignment method improves consistency by up to 0.10 points in κ_S, beating prompt-tuning and representation steering baselines.
  • Lower-resource languages like Indonesian and Persian show the largest inconsistency, highlighting bias in training data.

Why It Matters

Ensures multilingual LLMs give culturally consistent answers per user persona, critical for global customer support, education, and public services.