Research & Papers

WebFAQ 2.0 releases 198M multilingual Q&A pairs with hard negatives for AI training

arXiv cs.IR February 20, 2026

⚡The new dataset spans 108 languages and includes 1.25M queries with mined hard negatives to train better retrievers.

Deep Dive

Researchers from the University of Passau and others have released WebFAQ 2.0, a massive multilingual dataset containing 198 million question-answer pairs across 108 languages. It includes 14.3M bilingual aligned pairs and a separate dataset of 1.25M queries with 200 hard negatives each. This resource enables developers to fine-tune dense retrieval models using contrastive learning or knowledge distillation, significantly improving multilingual search and QA system performance.

Why It Matters

Provides the largest public resource for training AI models on real-world, multilingual question-answering tasks, directly improving search and chatbot accuracy.

Read Original Article

WebFAQ 2.0 releases 198M multilingual Q&A pairs with hard negatives for AI training

Why It Matters

Related Articles

🚀 Stay Ahead in AI