Research & Papers

MeVer's Cluster-Aware Mining Boosts Scientific Source Retrieval for Fact-Checking

A novel hard-negative mining approach that adapts to multi-stage retrieval pipelines improves multilingual source identification.

Deep Dive

In a new arXiv paper, researchers from MeVer present their submission to CheckThat! 2026 Task 1 on multilingual scientific-source retrieval. The challenge: matching short, informal, often multilingual social media claims to scientific publications, where semantically related papers act as distracting negatives during training. The authors propose cluster-aware hard-negative mining strategies that adapt to multi-stage retrieval pipelines. The key insight is that hard-negative mining should be stage-aware: localized cluster negatives improve precision-oriented retrieval, while broader non-gold semantic negatives provide better candidate coverage and more consistent reranking performance across languages.

The team also studied multiple LLM-based evidence-selection formulations—direct classification, pairwise comparison, and listwise reranking prompts—and found that constrained classification prompts deliver the most reliable final document selection. Their final system combines a dense retriever, a multilingual cross-encoder reranker, and a selective LLM-based disagreement resolver, achieving a rank of 6th among 37 submissions in the shared task evaluation. Overall, the results suggest that hard-negative mining should be treated as a stage-aware design problem, not a single optimization strategy. The work has practical implications for fact-checking, especially as misinformation increasingly spreads across languages.

Key Points
  • Proposes cluster-aware hard-negative mining that adapts to different stages (retrieval vs. reranking) in scientific-source retrieval pipelines.
  • Localized cluster negatives favor precision; broader non-gold semantic negatives improve candidate coverage and multilingual reranking consistency.
  • System combines a dense retriever, multilingual cross-encoder reranker, and LLM-based disagreement resolver, ranking 6th out of 37 entries.

Why It Matters

Enables more reliable fact-checking by improving how AI matches multilingual claims to scientific sources across languages.