Soro: A Lightweight Tajik LLM Outperforms Gemma 3 with 1.9B Token Training
Built from Gemma 3, fine-tuned on 1.9B Tajik tokens for edge deployment in schools.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A team led by Stanislav Liashkov and Bonu Boboeva introduced Soro, a family of lightweight conversational LLMs tailored for Tajik language and real-world deployment under tight compute constraints. Starting from open-weight Gemma 3 checkpoints, they conducted Tajik-only continual pretraining on a carefully curated 1.9-billion-token corpus that includes filtered web text, PDF documents, and curriculum-aligned educational materials. To address the lack of Tajik evaluation data, they also released a suite of open-source benchmarks covering general knowledge, linguistic competence, and school/university entrance exams.
Across all Tajik benchmarks, Soro substantially outperforms same-size Gemma 3 baselines while retaining strong performance on standard English datasets. The team further demonstrated that FP8 and INT4 quantization preserves most of the Tajik-language gains while reducing memory requirements for edge deployment. This work supports an ongoing education-sector pilot in Tajikistan and plans to scale out AI-assisted learning across schools, showcasing a practical pathway for low-resource language models in constrained environments.
- Built from Gemma 3 with continual pretraining on a 1.9B-token Tajik corpus and 40K supervised instruction examples.
- Introduces open-source Tajik benchmarks (general knowledge, linguistic competence, entrance exams) for rigorous evaluation.
- FP8/INT4 quantization preserves Tajik gains while reducing memory, enabling edge deployment in Tajikistani schools.
Why It Matters
Empowers AI education in Tajikistan with a lightweight, quantized model that outperforms base Gemma 3 on local benchmarks.