Morphemes Without Borders: Evaluating Root-Pattern Morphology in Arabic Tokenizers and LLMs
Research on 7 Arabic-centric LLMs reveals tokenizer design doesn't predict morphological generation quality.
A team of researchers from United Arab Emirates University, led by Yara Alakeel, published a study titled 'Morphemes Without Borders' that investigates how well large language models handle Arabic's unique morphological system. Arabic uses a root-pattern structure where a core root (like k-t-b for writing) is inserted into patterns to create words (kitab 'book', maktab 'office', kataba 'he wrote'). The team evaluated seven Arabic-centric and multilingual LLMs and their tokenizers against a gold-standard segmentation, probing whether they capture this genuine structure or just memorize surface forms.
Their key finding is counterintuitive: the morphological fidelity of a model's tokenizer—how well it splits words into correct morphemes—does not reliably predict the model's ability to generate correct Arabic morphology. A tokenizer with good alignment isn't necessary for the LLM to perform well, and having good alignment isn't sufficient to guarantee good performance. This result, accepted at the LREC 2026 conference, questions a core assumption in NLP for languages like Arabic, Hebrew, and Amharic, suggesting that the internal representations LLMs learn might be more abstract and flexible than previously thought, decoupling tokenization strategy from final linguistic capability.
- Study evaluated 7 Arabic & multilingual LLMs (e.g., GPT-4, Claude) on a new Arabic root-pattern generation test set.
- Found tokenizer morphological alignment is not necessary or sufficient for good morphological generation by the LLM.
- Challenges the assumed direct link between tokenizer design and downstream performance in morphologically complex languages.
Why It Matters
Simplifies LLM development for Arabic and similar languages, showing engineers can prioritize other architectural choices over perfect tokenization.