NumLeak reveals LLMs memorize numeric benchmarks, inflating scores
Top models recall Fama-French returns with r=0.99—it's memorization, not reasoning
Researchers at ICML 2026's MemFM workshop reveal that foundation models are secretly memorizing public numeric benchmarks from their pretraining data, rather than demonstrating out-of-sample reasoning. The NumLeak framework combines API probes on production models with white-box validation on open causal LMs. Top-tier models achieved Pearson correlations of 0.97–0.99 on the Fama-French market excess return, U.S. unemployment, CPI inflation, and NOAA temperature data—all within 0.15 within-25bps on sibling factors.
When tested on a recent holdout not seen in training, models' parse rates collapsed to 21–57%, but for months they did answer, r stayed at ~0.99, confirming a memorized channel. The white-box experiment reproduced a dose-response logprob ranking that open-ended generation misses, showing closed-API probes understate the problem. A Sonnet model's date-to-market-sentiment regression dropped from r=0.74 to r=0.02 once the model's own recall was residualized out. Crucially, a simple one-line system-prompt defense blocked 99.8% of single-turn suffix attacks with near-zero utility cost on factual and historical queries.
- NumLeak detected Pearson r=0.97-0.99 on financial and economic benchmarks across frontier LLMs
- On a recent holdout, parse rates fell to 21-57%, but answered months retained r~0.99—proving memorization over reasoning
- A one-line system-prompt defense blocked 99.8% of attack suffixes without degrading performance on conceptual queries
Why It Matters
Benchmark scores may be inflated—LLMs recall numbers instead of reasoning—threatening evaluation integrity in finance, policy, and science