Research & Papers

TajPersLexon bridges Tajik-Persian gap with 40K lexical pairs

Hybrid model hits 96.4% accuracy on OCR post-correction for low-resource scripts

Deep Dive

This work introduces TajPersLexon, a curated Tajik-Persian parallel lexical resource of 40,112 word and short-phrase pairs designed for cross-script lexical retrieval, transliteration, and alignment in low-resource settings. The authors conduct a comprehensive CPU-only benchmark comparing three methodological families: a lightweight hybrid pipeline, neural sequence-to-sequence models, and retrieval methods. Their evaluation establishes that the task is essentially solvable, with neural and retrieval baselines achieving 98-99% top-1 accuracy. Crucially, while large multilingual sentence transformers fail on this exact lexical matching, the interpretable hybrid model offers a favorable accuracy-efficiency trade-off for practical applications, achieving 96.4% accuracy in an OCR post-correction task. All experiments use fixed random seeds for full reproducibility, and the dataset, code, and models will be publicly released.

The work is published in the Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family (SilkRoadNLP 2026). By providing high-quality parallel data and a model that balances accuracy and computational efficiency, TajPersLexon addresses a critical gap in low-resource NLP for Persian and Tajik languages, which share vocabulary but use different scripts (Arabic and Cyrillic). This resource enables cross-script information retrieval, machine translation, and digital preservation for over 100 million speakers. The hybrid model's strong performance on OCR post-correction is particularly valuable for digitizing historical texts and processing noisy real-world data. For practitioners, this means a lightweight, reproducible pipeline that works on standard CPUs without expensive GPU infrastructure.

Key Points
  • Resource contains 40,112 Tajik-Persian word and short-phrase pairs across Arabic and Cyrillic scripts
  • Neural and retrieval models achieve 98-99% top-1 accuracy on lexical matching tasks
  • Hybrid model reaches 96.4% accuracy on OCR post-correction with CPU-only efficiency

Why It Matters

Enables cross-script NLP for Tajik-Persian, preserving language heritage and powering practical applications without expensive GPUs.