arXiv survey on Hausa and Fongbe NLP resources exposes key gaps
80M Hausa speakers have rich text data, but speech datasets are scarce.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A comprehensive survey published on arXiv (2605.22828) by Mahounan Pericles Adjovi, Victor Olufemi, Roald Eiselen, and Prasenjit Mitra maps the state of NLP resources for two West African languages: Hausa (spoken by 80-100 million people) and Fongbe (~2 million speakers in Benin). The authors systematically searched academic repositories, data platforms, and web sources to catalog parallel corpora, monolingual text collections, speech datasets, pre-trained models, and evaluation benchmarks. For each resource they documented size, domain coverage, format, licensing, and accessibility. The survey reveals stark contrasts: Hausa benefits from broad text resource diversity across news, encyclopedic, and educational domains, while Fongbe has far fewer text resources but has seen recent focused speech data collection efforts by academics.
The findings show both languages are already represented in Masakhane benchmarks for named entity recognition and part-of-speech tagging, indicating some existing community support. However, critical gaps remain. Fongbe lacks domain-diverse text corpora (most existing data is limited in genre), and Hausa lacks dedicated speech corpora for robust speech technology development. The authors provide task-specific recommendations for future data collection, such as prioritizing Fongbe text in new domains and expanding Hausa spoken language datasets. This survey is a vital roadmap for researchers and funders aiming to balance NLP progress across high-resource and low-resource languages, particularly in West Africa.
- Hausa (80-100M speakers) has broad text resources spanning news, encyclopedic, and educational domains.
- Fongbe (2M speakers) has limited text but significant recent academic speech data collection.
- Both languages are included in Masakhane benchmarks for NER and POS tagging, but gaps remain in domain-diverse Fongbe text and dedicated Hausa speech corpora.
Why It Matters
This survey provides a clear roadmap for closing the NLP resource gap for 80M+ West African speakers.