Research & Papers

[D] We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally wrong answers

Major long-term memory benchmark contains systematic errors and rewards vague, incorrect answers from AI systems.

Deep Dive

A systematic audit of the widely cited LoCoMo long-term memory benchmark has revealed fundamental flaws that undermine its reliability. Researchers found 99 score-corrupting errors in the benchmark's 1,540 questions, meaning 6.4% of the answer key is wrong. These errors include hallucinated facts not present in source materials, incorrect temporal reasoning (like calculating dates wrong), and speaker attribution mistakes. The audit shows that even a perfect memory system could only achieve approximately 93.6% on LoCoMo due to these baked-in errors, making fair comparisons between systems like Mem0, EverMemOS, and Zep impossible.

The problems extend beyond the answer key to the evaluation methodology itself. LoCoMo uses GPT-4o-mini as a judge to score system responses, but testing revealed this judge accepts 62.81% of intentionally wrong answers. While it catches specific factual errors about 89% of the time, it passes vague, topically-adjacent answers that miss all key details nearly two-thirds of the time. This failure mode directly rewards weak retrieval systems that locate the right conversation but extract nothing specific. Combined with the lack of a standardized evaluation pipeline across research teams, these issues mean published scores on LoCoMo cannot be trusted for comparing AI memory capabilities.

Meanwhile, the frequently cited alternative benchmark LongMemEval-S faces a different but equally fundamental problem: its 115K-token context per question fits entirely within modern model context windows, making it more of a context window test than a true memory test. This leaves the field without reliable benchmarks for measuring long-term memory in AI systems, with multiple researchers documenting inability to reproduce published results.

Key Points
  • 6.4% of LoCoMo's answer key contains errors (99/1540 questions), including hallucinated facts and wrong date calculations
  • GPT-4o-mini judge accepts 63% of intentionally wrong answers, rewarding vague responses over precise recall
  • No standardized evaluation pipeline exists, making scores from Mem0, EverMemOS, and Zep incomparable

Why It Matters

Flawed benchmarks distort AI progress tracking, making it impossible to accurately compare memory systems or measure real capability gains.