Research & Papers

WSDM Cup 2026 Multilingual Retrieval: A Low-Cost Multi-Stage Retrieval Pipeline

A four-stage system using Qwen3-Reranker-4B and jina-embeddings-v4 finds needles in 10M-document haystacks.

Deep Dive

Researchers Chentong Hao and Minmao Wang built a low-cost, four-stage multilingual retrieval pipeline for the WSDM Cup 2026. It uses LLM-based query expansion, BM25 retrieval, jina-embeddings-v4 for dense ranking, and Qwen3-Reranker-4B for final scoring. The system searches 10M news articles in Chinese, Persian, and Russian from English queries, achieving a Judged@20 score of 0.95. It demonstrates how to combine multiple AI models effectively under a limited compute budget.

Why It Matters

Provides a practical blueprint for building high-precision, cost-effective multilingual search systems using open-source AI components.