Research & Papers

KazakhOCR: A Synthetic Benchmark for Evaluating Multimodal Models in Low-Resource Kazakh Script OCR

arXiv cs.CV March 17, 2026

⚡New 7,219-image dataset reveals top AI models fail completely on Kazakh Arabic and Latin scripts.

Deep Dive

A team of researchers has published the KazakhOCR benchmark, a crucial new tool for evaluating how well multimodal large language models (MLLMs) handle low-resource languages. The benchmark consists of 7,219 synthetically generated images covering all three scripts used in Kazakh—Arabic, Cyrillic, and Latin—with variations in font, color, and noise to simulate real-world optical character recognition (OCR) tasks. This addresses a significant gap, as no existing OCR benchmarks or image datasets existed for the Arabic and Latin scripts of Kazakh, a Turkic language with unique orthographic challenges.

When testing three leading MLLMs—Gemma-3-12B-it, Qwen2.5-VL-7B-Instruct, and Llama-3.2-11B-Vision-Instruct—the results were stark. All models were completely unsuccessful at OCR for the Latin and Arabic scripts. Furthermore, they failed to even recognize the Arabic script as Kazakh text, frequently misclassifying it as Arabic, Farsi, or Kurdish. This indicates a fundamental lack of training data and capability for these specific, complex scripts.

The researchers also compared the MLLMs against a classical OCR baseline. They found that while traditional OCR systems had lower character error rates, the modern MLLMs could not match this performance. The findings, accepted for presentation at AbjadNLP @ EACL 2026, demonstrate a significant and previously unmeasured shortcoming in current AI capabilities. The paper argues this exposes a critical need for more inclusive model development and specialized benchmarks to support the world's many low-resource scripts and languages, which are often overlooked in favor of high-resource languages like English.

Key Points

New KazakhOCR benchmark contains 7,219 synthetic images across Kazakh's Arabic, Cyrillic, and Latin scripts.
Tested models (Gemma-3, Qwen2.5-VL, Llama-3.2-Vision) showed 0% success on Latin/Arabic script OCR and misclassified the language.
MLLMs underperformed compared to traditional OCR methods, revealing a major gap in low-resource script support.

Why It Matters

Exposes critical blind spots in AI's global accessibility, showing top models fail on scripts used by millions.

Read Original Article

KazakhOCR: A Synthetic Benchmark for Evaluating Multimodal Models in Low-Resource Kazakh Script OCR

Why It Matters

Stay Ahead in AI