OpenSanctions Pairs: Large-Scale Entity Matching with LLMs
A new 755,540-pair benchmark proves LLMs like GPT-4o can outperform rule-based systems by 7.6% F1 for sanctions screening.
A team from OpenSanctions has released a major new benchmark for evaluating AI in compliance and financial intelligence. The 'OpenSanctions Pairs' dataset contains 755,540 labeled entity pairs drawn from real-world international sanctions aggregation, spanning 293 sources across 31 countries. It presents the messy, multilingual reality of compliance work, with cross-script names, noisy attributes, and missing data, providing a rigorous testbed far beyond clean academic datasets.
Benchmarking results reveal a significant performance leap for modern LLMs. A production-grade, rule-based matching algorithm (the nomenklatura RegressionV1) scored 91.33% F1. In contrast, off-the-shelf large language models achieved near-ceiling performance: OpenAI's GPT-4o hit 98.95% F1, while a locally deployable open model, DeepSeek-R1-Distill-Qwen-14B, reached 98.23%. Interestingly, advanced prompt optimization with DSPy MIPROv2 provided only modest gains, and adding in-context examples often hurt performance, suggesting the models' inherent reasoning is highly effective.
The error analysis uncovered complementary failure modes. The traditional rule-based system tended to over-match, creating false positives, while LLMs primarily stumbled on complex cross-script transliterations and minor inconsistencies in identifiers or dates. The researchers conclude that pairwise matching accuracy is now so high that the field's effort should pivot to other pipeline challenges, such as efficient 'blocking' to reduce comparison pairs and 'clustering' matched entities, which are now the greater bottlenecks to fully automated, reliable sanctions screening.
- The OpenSanctions Pairs benchmark contains 755,540 real-world entity pairs from 293 international sanctions sources, creating a robust test for messy, multilingual data.
- GPT-4o achieved 98.95% F1 score, outperforming a production rule-based system (91.33% F1) by 7.6%, with a local open model (DeepSeek-R1) reaching 98.23%.
- The study shows pairwise matching is nearing a solved problem, directing future R&D toward pipeline stages like blocking and clustering for full workflow automation.
Why It Matters
This validates LLMs for high-stakes compliance, potentially reducing manual review in sanctions screening and anti-money laundering by automating the most complex matching tasks.