Research & Papers

[D] MemPalace claims 100% on LoCoMo and a "perfect score on LongMemEval." Its own BENCHMARKS.md documents why neither is meaningful.

r/MachineLearning April 07, 2026

⚡Open-source project's own documentation reveals how its perfect scores on LoCoMo and LongMemEval were achieved.

Deep Dive

The open-source MemPalace project launched with explosive claims of achieving 100% on the LoCoMo benchmark and a "perfect score" on LongMemEval, quickly gaining over 7,000 GitHub stars and 1.5 million tweet views. However, the project's own documentation reveals these headline numbers are misleading. For LoCoMo, the 100% score was achieved by setting top_k=50, which retrieves entire conversations since each has only 19-32 sessions—effectively bypassing the embedding retrieval challenge entirely. The honest performance metrics in the same file show 60.3% recall@10 without reranking and 88.9% with hybrid scoring.

For LongMemEval, MemPalace's "perfect score" represents a fundamental category error. While the published benchmark measures end-to-end QA with GPT-4 judging generated answers, MemPalace only performed retrieval—checking if correct session IDs appeared in top-5 results without generating answers or invoking judges. This recall_any@5 retrieval task is substantially easier than the full benchmark. The project's documentation acknowledges these methodological issues while the viral launch tweet omitted these critical caveats, highlighting ongoing debates in AI memory system evaluation.

Key Points

LoCoMo 100% achieved via top_k=50 bypass that retrieves entire conversations (19-32 sessions each)
LongMemEval "perfect score" measures only retrieval recall without answer generation or GPT-4 judging
Project's own BENCHMARKS.md documents these issues while viral launch tweet omitted caveats

Why It Matters

Highlights critical need for transparent benchmarking in AI memory systems as projects chase viral metrics over meaningful progress.

Read Original Article

[D] MemPalace claims 100% on LoCoMo and a "perfect score on LongMemEval." Its own BENCHMARKS.md documents why neither is meaningful.

Why It Matters

Stay Ahead in AI