Research & Papers

Cross-Lingual Response Consistency in Large Language Models: An ILR-Informed Evaluation of Claude Across Six Languages

French outputs are 30% longer than German; creative tasks diverge most across languages.

Deep Dive

A new paper on arXiv (cs.CL/2604.27137) introduces a systematic evaluation framework grounded in the Interagency Language Roundtable (ILR) Skill Level Descriptions and applies it to Anthropic's Claude Sonnet 4.6 across six languages: English, French, Romanian, Spanish, Italian, and German. The study uses 12 semantically equivalent prompt clusters spanning ILR complexity levels 1 through 3+, collecting 216 responses (12 prompts × 6 languages × 3 runs). Analysis combines automated quantitative metrics with expert ILR qualitative assessment by a six-language professional with 12 years of OPI experience.

Key results show that French responses are approximately 30% longer than German responses on identical prompts, with creative and affective clusters showing the highest cross-lingual surface divergence. The qualitative analysis identifies five variation patterns: systematic differences in pragmatic disambiguation strategies, aesthetic and literary tradition divergence in creative output, language-internal technical terminology norms, cultural calibration gaps (absence of culture-specific content in favor of neutralized templates), and language-specific institutional referral behavior in emotional support responses. The authors argue ILR-informed expert judgment complements purely computational benchmarks and that cross-lingual variation in Claude is interpretable, domain-dependent, and consequential for equitable multilingual AI deployment.

Key Points
  • French responses are 30% longer than German across identical prompts in Claude Sonnet 4.6.
  • Creative/affective prompts show highest cross-lingual divergence, exposing aesthetic and cultural calibration gaps.
  • Five variation patterns identified: pragmatic disambiguation, literary tradition, terminology norms, cultural neutralization, and institutional referral in emotional support.

Why It Matters

Reveals systematic language biases in LLMs, crucial for equitable deployment in multilingual enterprise and global services.