Strongest zero-shot multilingual retriever underperforms monolingual Amharic retriever by 23% relative MRR@10?

Strongest zero-shot multilingual retriever underperforms monolingual Amharic retriever by 23% relative MRR@10.

Fine-tuning multilingual models on Amharic yields 32-60% gains but still falls short of the best monolingual model?

Fine-tuning multilingual models on Amharic yields 32-60% gains but still falls short of the best monolingual model.

Researchers release dataset, code, and trained models to promote in-language retrieval evaluation for underrepresented languages?

Researchers release dataset, code, and trained models to promote in-language retrieval evaluation for underrepresented languages.

Research & Papers

Multilingual Retrieval Fails for Amharic: 23% Gap Over Monolingual Models

arXiv cs.IR May 26, 2026

⚡Zero-shot multilingual retrievers fall 23% behind monolingual Amharic models in retrieval accuracy.

Deep Dive

A new study accepted at the MeLLM workshop (ACL 2026) reveals a significant performance gap in multilingual retrieval for underrepresented languages. The authors—Yosef Worku Alemneh, Kidist Amde Mekonnen, and Maarten de Rijke—focus on Amharic, a morphologically rich Semitic language spoken by over 30 million people. Using a shared passage retrieval protocol across dense, late-interaction, learned sparse, and cross-encoder paradigms, they compare zero-shot multilingual retrievers, Amharic-fine-tuned multilingual models, and fully monolingual Amharic retrievers. The strongest zero-shot multilingual retriever underperformed the best monolingual Amharic first-stage retriever by 23% relative MRR@10. Fine-tuning two recent multilingual embedding models on the same Amharic supervision gave 32-60% relative MRR@10 gains over zero-shot, but the best fine-tuned model still remained below the top monolingual Amharic retriever.

These findings have critical implications for equitable access to information in the LLM era. The authors argue that zero-shot multilingual retrieval is not a sufficient proxy for performance on underrepresented languages, and that retrieval must be evaluated and adapted in-language rather than inferred from aggregate multilingual benchmarks. The paper is a wake-up call for developers of retrieval-augmented generation (RAG) systems and cross-lingual QA pipelines: strong average scores on benchmarks like MIRACL or XOR-TyDi can mask severe disparities. To support further research, the team publicly released the Amharic retrieval dataset, codebase, and trained models, enabling the community to replicate and extend their findings. The work highlights the need for language-specific retrieval evaluation to ensure AI systems serve all languages equitably.

Key Points

Strongest zero-shot multilingual retriever underperforms monolingual Amharic retriever by 23% relative MRR@10.
Fine-tuning multilingual models on Amharic yields 32-60% gains but still falls short of the best monolingual model.
Researchers release dataset, code, and trained models to promote in-language retrieval evaluation for underrepresented languages.

Why It Matters

Fair AI access requires evaluating retrieval models in each language, not relying on multilingual benchmarks.

Read Original Article

Multilingual Retrieval Fails for Amharic: 23% Gap Over Monolingual Models

Why It Matters

Related Articles

🚀 Stay Ahead in AI