Research & Papers

Multilingual Language Models Encode Script Over Linguistic Structure

Multilingual AI models organize knowledge by writing system first, not language family, challenging assumptions about how they understand text.

Deep Dive

A new study accepted at ACL 2026 reveals that popular multilingual language models like Meta's Llama-3.2-1B and Google's Gemma-2-2B organize their internal representations primarily based on the surface script (orthography) of text, not its deeper linguistic structure or language family. Researchers from IIT Delhi and IIIT Delhi used techniques like Language Activation Probability Entropy (LAPE) and Sparse Autoencoders to analyze these compact, distilled models where representational trade-offs are more explicit. They discovered that romanizing a language (writing it in the Latin alphabet) creates representations that are nearly disjoint from both the native script and English, showing the model's strong bias toward visual form.

The research further probed how these models handle linguistic abstraction. While word-order shuffling had limited effect on how the model identified language units, deeper layers of the neural network did become increasingly capable of accessing typological structure. Crucially, causal intervention experiments showed that the model's text generation was most sensitive to internal units that remained stable despite surface-form changes, not to units identified by typological alignment alone. This indicates that linguistic abstraction in these models emerges gradually through the layers, without collapsing all languages into a single, unified representation space (an 'interlingua'). The findings challenge the assumption that multilingual LLMs build a deep, language-agnostic understanding, highlighting instead their heavy reliance on orthographic cues.

Key Points
  • Llama-3.2-1B and Gemma-2-2B organize languages by script (e.g., Devanagari, Latin) first, not by linguistic family, creating disjoint representations for romanized text.
  • Deeper neural network layers show increasing access to typological structure, but generation depends on units stable across surface changes, not pure typology.
  • The study suggests multilingual models do not create a single 'interlingua' but instead build abstraction gradually from a script-based foundation.

Why It Matters

This impacts how we design and evaluate multilingual AI, showing script normalization (like romanization) can disrupt a model's understanding rather than help it.