Less is More: Adapting Text Embeddings for Low-Resource Languages with Small Scale Noisy Synthetic Data
Fine-tuning on small, noisy synthetic data yields 20%+ retrieval gains, matching models trained on 1M examples.
A research team led by Zaruhi Navasardyan has published a paper challenging the need for massive datasets to create effective text embeddings for low-resource languages (LRLs). Their work, "Less is More: Adapting Text Embeddings for Low-Resource Languages with Small Scale Noisy Synthetic Data," demonstrates that fine-tuning a pre-existing multilingual encoder like mE5 on a surprisingly small amount of noisy, synthetic data can yield dramatic performance gains. Focusing on Armenian as a test case, they generated just 10,000 training pairs by machine-translating English Reddit title-body posts using open-weights models, deliberately avoiding high-cost, human-verified translations.
Their comprehensive evaluation revealed a 'Less is More' phenomenon: this minimal, noisy dataset produced an average 11-12% improvement across their benchmark and a more than 20% relative boost in retrieval performance. Crucially, this performance matched that of models trained on approximately one million examples. The researchers further found that neither scaling the data, improving translation quality with state-of-the-art LLMs, nor diversifying data domains provided significant additional gains over their simple baseline. They validated the approach on another LRL with a unique script, suggesting the findings are generalizable.
The results indicate that semantic alignment for LRLs saturates early and is highly robust to noise, fundamentally shifting the cost-benefit analysis for embedding adaptation. This breakthrough significantly lowers the barrier to creating high-quality semantic search and RAG (retrieval-augmented generation) systems for languages that lack large-scale, curated datasets. The team has released their model, data, and benchmark publicly to facilitate further research and application by resource-constrained communities.
- Fine-tuning the mE5 encoder on only 10,000 noisy synthetic pairs improved retrieval performance by over 20%.
- The small-scale approach matched the performance of models trained on ~1 million high-quality examples.
- The method proved robust, with neither increased data scale nor improved translation quality yielding significant extra gains.
Why It Matters
This dramatically lowers the cost and complexity of building effective AI search and RAG tools for hundreds of underserved languages.