FinRetrieval: A Benchmark for Financial Data Retrieval by AI Agents
Claude Opus hits 90.8% accuracy with structured APIs but plummets to 19.8% with web search alone.
Researchers Eric Y. Kim and Jie Huang have introduced FinRetrieval, a comprehensive benchmark designed to evaluate how effectively AI agents retrieve specific numeric values from structured financial databases. The benchmark consists of 500 financial retrieval questions with ground truth answers, agent responses from 14 configurations across three frontier providers (Anthropic, OpenAI, Google), and complete tool call execution traces. This systematic evaluation reveals that tool availability dramatically dominates performance, with Claude Opus achieving 90.8% accuracy when using structured data APIs but plummeting to just 19.8% with web search alone—a staggering 71 percentage point gap that exceeds other providers by 3-4x. The findings highlight a critical dependency on proper data infrastructure for financial AI applications.
The study uncovers nuanced insights about reasoning capabilities across different models. Researchers found that reasoning mode benefits vary inversely with base capability, providing a +9.0 percentage point improvement for OpenAI models versus only +2.8pp for Claude Opus. This difference stems from variations in base-mode tool utilization rather than inherent reasoning ability. Geographic performance gaps also emerged, with a 5.6 percentage point advantage for US-based queries explained by fiscal year naming conventions rather than model limitations. The team has released the complete dataset, evaluation code, and tool traces to enable further research on financial AI systems, providing a standardized way to measure progress in this critical domain.
- Claude Opus shows 90.8% accuracy with structured APIs vs 19.8% with web search alone (71pp gap)
- Reasoning mode provides +9.0pp boost for OpenAI models vs +2.8pp for Claude Opus
- 5.6pp US advantage stems from fiscal year naming conventions, not model limitations
Why It Matters
Reveals critical infrastructure dependency for financial AI—structured data access matters more than model selection for accurate retrieval.