Memory Benchmark Rankings Can Flip Based on Scoring Target Choice
A single benchmark design choice silently changes which memory system appears best.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A new paper from Sugam Panthi and Rabab Abdelfattah reveals a hidden fragility in how we evaluate conversational memory systems for large language models (LLMs). These systems transform dialogue history into facts, summaries, timelines, and other derived forms, meaning a single original turn can coexist with several stored memories in the same retrieval index. The problem arises when deciding which version should receive retrieval credit during evaluation — a choice the authors call the scoring target.
Using their TIAP (Target-Invariant Audit Protocol), the researchers rescored saved ranked outputs under three distinct targets — Raw (original turn), Source (linked to original together), and Canonical (only derived forms) — without rerunning retrieval. On two popular benchmarks, LoCoMo and LongMemEval-S, switching only the credited target changed normalized discounted cumulative gain (nDCG) on 83.4% to 94.0% of shared queries. This shift flipped system orderings on Mem0 and MemoryOS transfer runs and even reversed recommendations about parser density. A separate 1,902-case semantic audit further found that relaxed source-linked credit was fully justified only 29.2% of the time, even though human rubric reliability was high. The bottom line: conclusions about which memory architecture performs best are not target-invariant — they silently depend on a single, often implicit, benchmark-design choice.
- TIAP audit rescored outputs under three targets (Raw, Source, Canonical) on LoCoMo and LongMemEval-S benchmarks.
- Switching targets changed nDCG on 83.4–94.0% of shared queries and flipped orderings on Mem0 and MemoryOS.
- Only 29.2% of relaxed source-linked credit was justified in a 1,902-case semantic audit.
- Parser-density recommendations reversed depending on chosen scoring target.
Why It Matters
Benchmark rankings of memory systems can silently flip — researchers must now report scoring targets explicitly.