Research & Papers

Soro: A Lightweight Tajik LLM Outperforms Gemma 3 with 1.9B Token Training

Built from Gemma 3, fine-tuned on 1.9B Tajik tokens for edge deployment in schools.

Deep Dive

A team led by Stanislav Liashkov and Bonu Boboeva introduced Soro, a family of lightweight conversational LLMs tailored for Tajik language and real-world deployment under tight compute constraints. Starting from open-weight Gemma 3 checkpoints, they conducted Tajik-only continual pretraining on a carefully curated 1.9-billion-token corpus that includes filtered web text, PDF documents, and curriculum-aligned educational materials. To address the lack of Tajik evaluation data, they also released a suite of open-source benchmarks covering general knowledge, linguistic competence, and school/university entrance exams.

Across all Tajik benchmarks, Soro substantially outperforms same-size Gemma 3 baselines while retaining strong performance on standard English datasets. The team further demonstrated that FP8 and INT4 quantization preserves most of the Tajik-language gains while reducing memory requirements for edge deployment. This work supports an ongoing education-sector pilot in Tajikistan and plans to scale out AI-assisted learning across schools, showcasing a practical pathway for low-resource language models in constrained environments.

Key Points
  • Built from Gemma 3 with continual pretraining on a 1.9B-token Tajik corpus and 40K supervised instruction examples.
  • Introduces open-source Tajik benchmarks (general knowledge, linguistic competence, entrance exams) for rigorous evaluation.
  • FP8/INT4 quantization preserves Tajik gains while reducing memory, enabling edge deployment in Tajikistani schools.

Why It Matters

Empowers AI education in Tajikistan with a lightweight, quantized model that outperforms base Gemma 3 on local benchmarks.