Research & Papers

WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval

The new dataset spans 108 languages and includes 1.25M queries with mined hard negatives to train better retrievers.

Deep Dive

Researchers from the University of Passau and others have released WebFAQ 2.0, a massive multilingual dataset containing 198 million question-answer pairs across 108 languages. It includes 14.3M bilingual aligned pairs and a separate dataset of 1.25M queries with 200 hard negatives each. This resource enables developers to fine-tune dense retrieval models using contrastive learning or knowledge distillation, significantly improving multilingual search and QA system performance.

Why It Matters

Provides the largest public resource for training AI models on real-world, multilingual question-answering tasks, directly improving search and chatbot accuracy.