Research & Papers

JP Morgan's new benchmark reveals LLMs collapse without memory by turn 3

The prevailing assumption that larger context windows solve memory retention in LLMs is shattered by JP Morgan’s EnterpriseMem-Bench, which shows every tested model hitting 0% accuracy by the third conversation turn.

Deep Dive

JP Morgan’s EnterpriseMem-Bench, a 300-session multi-turn Text-to-SQL benchmark, exposes a stark limitation in today’s large language models: complete stateless execution collapse by turn 3. Models including GPT-5 mini, GPT-5.2, Claude Sonnet 4.5, Sonnet 4.6, and Opus 4.6 all achieve 0% accuracy on the third turn when no external memory is provided. This is not a gradual decay but a catastrophic failure—a finding that should alarm any enterprise deploying LLMs for multi-step decision-making or data retrieval. The benchmark simulates realistic enterprise queries over financial data, where a single session may require several sequential operations. Without the crutch of caching, retrieval augmentation, or fine-tuned prompts, the models effectively forget the entire task context after just two exchanges.

Why do all models fail uniformly, despite massive differences in architecture and training? The answer lies in the fundamental design of transformers: self-attention can process long sequences, but it cannot maintain a dedicated working memory across turns without explicit state management. Earlier benchmarks—Spider (2018), CoSQL (2019), SParC (2019), and BIRD (2023)—focused on single-turn or turn-to-turn accuracy but never isolated memory decay over multiple turns in a controlled enterprise setting. EnterpriseMem-Bench fills that gap, and the result is damning. Even Claude Sonnet 4.6, a later version, regressed 17 to 33 percentage points compared to Sonnet 4.5 on SEC EDGAR data, indicating that scaling parameters and context windows does not guarantee—and may even harm—multi-turn retention. This suggests a structural ceiling independent of model size.

The implications for enterprise adoption are profound. Salesforce, Google DeepMind, and Microsoft all offer LLM-powered SQL copilots that promise to streamline analytics. Salesforce relies on retrieval-augmented generation (RAG) within its CRM workflows; Microsoft’s Azure SQL Copilot uses vector databases to store user-specific context. These external memory layers can lift accuracy well above the 0% baseline, but they are bolted-on solutions that introduce latency, cost, and complexity. JP Morgan’s benchmark, conducted under stateless conditions (no RAG, no caching), reveals the raw LLM deficit. It suggests that the enterprise market’s current trajectory—throwing larger context windows at the problem—is misguided. The global AI database market is projected to exceed $10 billion by 2028, and this finding will accelerate investment in architectures that bake memory into the model itself, such as stateful agents, memory-augmented transformers, or hybrid retrieval-generation systems. For regulated industries like finance, where query accuracy is non-negotiable, the message is clear: do not deploy stateless LLMs for critical multi-turn tasks without a robust memory layer. The three-turn ceiling is not a bug to be patched—it is a fundamental architectural constraint that demands a new paradigm.

Key Points
  • All major LLMs achieve 0% stateless execution accuracy by turn 3 on multi-turn SQL, proving that context window size alone cannot solve working-memory problems.
  • Claude Sonnet 4.6 regressed 17–33 percentage points on SEC EDGAR data versus its predecessor, showing that newer models may worsen multi-turn retention.
  • Enterprise adoption of LLMs for database queries will require explicit memory augmentation (RAG, vector stores, stateful agents), accelerating a shift toward hybrid architectures.

Why It Matters

JP Morgan’s benchmark reveals a fundamental LLM limitation that forces a rethinking of enterprise AI beyond larger context windows.