KazakhOCR: A Synthetic Benchmark for Evaluating Multimodal Models in Low-Resource Kazakh Script OCR
New 7,219-image dataset reveals top AI models fail completely on Kazakh Arabic and Latin scripts.
A team of researchers has published the KazakhOCR benchmark, a crucial new tool for evaluating how well multimodal large language models (MLLMs) handle low-resource languages. The benchmark consists of 7,219 synthetically generated images covering all three scripts used in Kazakh—Arabic, Cyrillic, and Latin—with variations in font, color, and noise to simulate real-world optical character recognition (OCR) tasks. This addresses a significant gap, as no existing OCR benchmarks or image datasets existed for the Arabic and Latin scripts of Kazakh, a Turkic language with unique orthographic challenges.
When testing three leading MLLMs—Gemma-3-12B-it, Qwen2.5-VL-7B-Instruct, and Llama-3.2-11B-Vision-Instruct—the results were stark. All models were completely unsuccessful at OCR for the Latin and Arabic scripts. Furthermore, they failed to even recognize the Arabic script as Kazakh text, frequently misclassifying it as Arabic, Farsi, or Kurdish. This indicates a fundamental lack of training data and capability for these specific, complex scripts.
The researchers also compared the MLLMs against a classical OCR baseline. They found that while traditional OCR systems had lower character error rates, the modern MLLMs could not match this performance. The findings, accepted for presentation at AbjadNLP @ EACL 2026, demonstrate a significant and previously unmeasured shortcoming in current AI capabilities. The paper argues this exposes a critical need for more inclusive model development and specialized benchmarks to support the world's many low-resource scripts and languages, which are often overlooked in favor of high-resource languages like English.
- New KazakhOCR benchmark contains 7,219 synthetic images across Kazakh's Arabic, Cyrillic, and Latin scripts.
- Tested models (Gemma-3, Qwen2.5-VL, Llama-3.2-Vision) showed 0% success on Latin/Arabic script OCR and misclassified the language.
- MLLMs underperformed compared to traditional OCR methods, revealing a major gap in low-resource script support.
Why It Matters
Exposes critical blind spots in AI's global accessibility, showing top models fail on scripts used by millions.