All major LLMs achieve 0% stateless execution accuracy by turn 3 on multi-turn SQL, proving that context window size alone cannot solve working-memory problems?

All major LLMs achieve 0% stateless execution accuracy by turn 3 on multi-turn SQL, proving that context window size alone cannot solve working-memory problems.

Claude Sonnet 4.6 regressed 17–33 percentage points on SEC EDGAR data versus its predecessor, showing that newer models may worsen multi-turn retention?

Claude Sonnet 4.6 regressed 17–33 percentage points on SEC EDGAR data versus its predecessor, showing that newer models may worsen multi-turn retention.

Enterprise adoption of LLMs for database queries will require explicit memory augmentation (RAG, vector stores, stateful agents), accelerating a shift toward hybrid architectures?

Enterprise adoption of LLMs for database queries will require explicit memory augmentation (RAG, vector stores, stateful agents), accelerating a shift toward hybrid architectures.

Research & Papers

JP Morgan's new benchmark reveals LLMs collapse without memory by turn 3

arXiv cs.CL May 27, 2026

⚡The prevailing assumption that larger context windows solve memory retention in LLMs is shattered by JP Morgan’s EnterpriseMem-Bench, which shows every tested model hitting 0% accuracy by the third conversation turn.

Deep Dive

A new study from JP Morgan Chase's LLM Suite Engineering Team reveals stark limitations in current frontier models for multi-turn Text-to-SQL tasks. The researchers built EnterpriseMem-Bench, a benchmark of 300 sessions and 1,400 turns across three enterprise domains (BIRD financial, SEC EDGAR, Northwind), with per-turn memory-critical annotations. They evaluated GPT-5 mini, GPT-5.2, Claude Sonnet 4.5, Sonnet 4.6, and Opus 4.6 across five memory conditions, isolating working-memory window size, episodic retrieval, and semantic augmentation. The most alarming finding: stateless multi-turn Text-to-SQL execution accuracy collapses to exactly 0% by Turn 3 for all five models, even with reasoning enabled. Memory-architecture complexity did not monotonically improve accuracy — working memory dominated, while adding components produced model- and dataset-dependent effects ranging from +14 to -16 percentage points.

Another surprising result: Claude Sonnet 4.6 underperformed Sonnet 4.5 by 17-33 percentage points on the SEC EDGAR dataset across all memory conditions, a generational regression that persisted under reasoning. Additionally, when reasoning was enabled, Claude models showed a mono-modal error distribution — every incorrect turn was a wrong-result error, with no partial or syntax errors. The paper introduces the Memory Benefit Score (MBS) as a per-turn diagnostic and releases the benchmark, agent, and evaluation code. These findings have serious implications for enterprise analytics applications that rely on conversational database querying, suggesting that current architectures require robust memory integration to avoid catastrophic failure in extended interactions.

Key Points

All major LLMs achieve 0% stateless execution accuracy by turn 3 on multi-turn SQL, proving that context window size alone cannot solve working-memory problems.
Claude Sonnet 4.6 regressed 17–33 percentage points on SEC EDGAR data versus its predecessor, showing that newer models may worsen multi-turn retention.
Enterprise adoption of LLMs for database queries will require explicit memory augmentation (RAG, vector stores, stateful agents), accelerating a shift toward hybrid architectures.

Why It Matters

JP Morgan’s benchmark reveals a fundamental LLM limitation that forces a rethinking of enterprise AI beyond larger context windows.

Read Original Article

JP Morgan's new benchmark reveals LLMs collapse without memory by turn 3

Why It Matters

Related Articles

🚀 Stay Ahead in AI