Nsanku: Evaluating Zero-Shot Translation Performance of LLMs for Ghanaian Languages
No LLM achieved both high performance and consistency across 43 Ghanaian languages, study finds
A team of 16 researchers from Ghanaian institutions published Nsanku, the most comprehensive benchmark to date for zero-shot machine translation of Ghanaian languages using large language models. The study evaluated 19 open-weight and proprietary LLMs – including Gemini 2.5 Flash, Claude Sonnet 4.5, GPT-4.1, and Kimi K2 Instruct – across 43 Ghanaian languages paired with English. Evaluation sentences (300 per language) were sourced from the YouVersion Bible platform, ensuring consistent parallel data. Two automatic metrics were employed: BLEU (n-gram precision) and chrF (character-level F-score), plus a cross-language consistency score.
Results show Gemini 2.5 Flash leading with an average score of 26.88 (BLEU 24.60, chrF 29.16), followed by Claude Sonnet 4.5 at 24.87 and GPT-4.1 at 23.20. The best open-weight model, Kimi K2 Instruct, scored 20.87. However, a critical finding is that no model – and no individual language – simultaneously achieved high performance and high consistency. This indicates that even top LLMs produce unreliable translations when scaling to the linguistic diversity of Ghana. Per-language averages ranged from Siwu (25.73, best) to Nkonya (11.65, worst). Nsanku is publicly released as an extensible infrastructure for African language NLP research.
- Gemini 2.5 Flash tops with avg 26.88 (BLEU 24.60, chrF 29.16) on 43 Ghanaian languages
- No model reached high performance + high consistency quadrant – LLMs unreliable for scalable translation
- Siwu performed best (25.73), Nkonya worst (11.65) among 43 languages tested
Why It Matters
Highlights critical gaps in LLM multilingual capabilities for low-resource African languages, guiding future NLP research and model development.