AFRILANGTUTOR: Advancing Language Tutoring and Culture Education in Low-Resource Languages with Large Language Models
A new dataset of 194.7K dictionary entries powers AI tutors for languages with scarce resources.
A team of 7 researchers from multiple institutions has developed AfriLangTutor, a novel AI language tutoring system designed for low-resource African languages. To overcome the lack of training data, they first created AfriLangDict, a curated collection of 194.7K African language-English dictionary entries. This seed resource was used to automatically generate AfriLangEdu, a dataset of 78.9K multi-turn student-tutor question-answer interactions suitable for supervised fine-tuning (SFT) and direct preference optimization (DPO).
The team fine-tuned two multilingual LLMs—Llama-3-8B-IT and Gemma-3-12B-IT—on AfriLangEdu across 10 African languages. Results showed that models trained on their dataset consistently outperformed base counterparts, with combined SFT and DPO yielding gains of 1.8% to 15.5% under LLM-as-a-judge evaluations across four criteria. All resources are open-sourced to accelerate research into AI-powered language education for underserved communities.
- AfriLangDict contains 194.7K dictionary entries as seed data for generating language-learning materials
- AfriLangEdu includes 78.9K multi-turn training examples for SFT and DPO across 10 African languages
- Combined SFT+DPO training on Llama-3-8B and Gemma-3-12B improved performance by 1.8% to 15.5%
Why It Matters
Democratizes AI language tutoring for 10 African languages, enabling scalable education in underserved communities.