NumLeak detected Pearson r=0.97-0.99 on financial and economic benchmarks across frontier LLMs?

NumLeak detected Pearson r=0.97-0.99 on financial and economic benchmarks across frontier LLMs

On a recent holdout, parse rates fell to 21-57%, but answered months retained r~0.99—proving memorization over reasoning?

On a recent holdout, parse rates fell to 21-57%, but answered months retained r~0.99—proving memorization over reasoning

A one-line system-prompt defense blocked 99.8% of attack suffixes without degrading performance on conceptual queries?

A one-line system-prompt defense blocked 99.8% of attack suffixes without degrading performance on conceptual queries

Research & Papers

NumLeak reveals LLMs memorize numeric benchmarks, inflating scores

arXiv cs.LG June 01, 2026

⚡Top models recall Fama-French returns with r=0.99—it's memorization, not reasoning

Deep Dive

Researchers at ICML 2026's MemFM workshop reveal that foundation models are secretly memorizing public numeric benchmarks from their pretraining data, rather than demonstrating out-of-sample reasoning. The NumLeak framework combines API probes on production models with white-box validation on open causal LMs. Top-tier models achieved Pearson correlations of 0.97–0.99 on the Fama-French market excess return, U.S. unemployment, CPI inflation, and NOAA temperature data—all within 0.15 within-25bps on sibling factors.

When tested on a recent holdout not seen in training, models' parse rates collapsed to 21–57%, but for months they did answer, r stayed at ~0.99, confirming a memorized channel. The white-box experiment reproduced a dose-response logprob ranking that open-ended generation misses, showing closed-API probes understate the problem. A Sonnet model's date-to-market-sentiment regression dropped from r=0.74 to r=0.02 once the model's own recall was residualized out. Crucially, a simple one-line system-prompt defense blocked 99.8% of single-turn suffix attacks with near-zero utility cost on factual and historical queries.

Key Points

NumLeak detected Pearson r=0.97-0.99 on financial and economic benchmarks across frontier LLMs
On a recent holdout, parse rates fell to 21-57%, but answered months retained r~0.99—proving memorization over reasoning
A one-line system-prompt defense blocked 99.8% of attack suffixes without degrading performance on conceptual queries

Why It Matters

Benchmark scores may be inflated—LLMs recall numbers instead of reasoning—threatening evaluation integrity in finance, policy, and science

Read Original Article

NumLeak reveals LLMs memorize numeric benchmarks, inflating scores

Why It Matters

Related Articles

🚀 Stay Ahead in AI