TIAP audit rescored outputs under three targets (Raw, Source, Canonical) on LoCoMo and LongMemEval-S benchmarks?

TIAP audit rescored outputs under three targets (Raw, Source, Canonical) on LoCoMo and LongMemEval-S benchmarks.

Switching targets changed nDCG on 83.4–94.0% of shared queries and flipped orderings on Mem0 and MemoryOS?

Switching targets changed nDCG on 83.4–94.0% of shared queries and flipped orderings on Mem0 and MemoryOS.

Only 29.2% of relaxed source-linked credit was justified in a 1,902-case semantic audit?

Only 29.2% of relaxed source-linked credit was justified in a 1,902-case semantic audit.

Parser-density recommendations reversed depending on chosen scoring target?

Parser-density recommendations reversed depending on chosen scoring target.

Research & Papers

Memory Benchmark Rankings Can Flip Based on Scoring Target Choice

arXiv cs.IR May 26, 2026

⚡A single benchmark design choice silently changes which memory system appears best.

Deep Dive

A new paper from Sugam Panthi and Rabab Abdelfattah reveals a hidden fragility in how we evaluate conversational memory systems for large language models (LLMs). These systems transform dialogue history into facts, summaries, timelines, and other derived forms, meaning a single original turn can coexist with several stored memories in the same retrieval index. The problem arises when deciding which version should receive retrieval credit during evaluation — a choice the authors call the scoring target.

Using their TIAP (Target-Invariant Audit Protocol), the researchers rescored saved ranked outputs under three distinct targets — Raw (original turn), Source (linked to original together), and Canonical (only derived forms) — without rerunning retrieval. On two popular benchmarks, LoCoMo and LongMemEval-S, switching only the credited target changed normalized discounted cumulative gain (nDCG) on 83.4% to 94.0% of shared queries. This shift flipped system orderings on Mem0 and MemoryOS transfer runs and even reversed recommendations about parser density. A separate 1,902-case semantic audit further found that relaxed source-linked credit was fully justified only 29.2% of the time, even though human rubric reliability was high. The bottom line: conclusions about which memory architecture performs best are not target-invariant — they silently depend on a single, often implicit, benchmark-design choice.

Key Points

TIAP audit rescored outputs under three targets (Raw, Source, Canonical) on LoCoMo and LongMemEval-S benchmarks.
Switching targets changed nDCG on 83.4–94.0% of shared queries and flipped orderings on Mem0 and MemoryOS.
Only 29.2% of relaxed source-linked credit was justified in a 1,902-case semantic audit.
Parser-density recommendations reversed depending on chosen scoring target.

Why It Matters

Benchmark rankings of memory systems can silently flip — researchers must now report scoring targets explicitly.

Read Original Article

Memory Benchmark Rankings Can Flip Based on Scoring Target Choice

Why It Matters

Related Articles

🚀 Stay Ahead in AI