MemConflict framework reveals flaws in AI long-term memory retrieval
New benchmark exposes how LLMs fail when memory conflicts with facts or time...
A new paper titled MemConflict introduces a diagnostic framework to evaluate long-term memory systems in LLM-based conversational agents. Existing evaluations focus on outcome-level accuracy or temporal updating but miss how systems retrieve and rank memory evidence when conflicting alternatives exist. MemConflict formalizes three conflict types: dynamic (changing over time), static (factual contradictions), and conditional (context-dependent mismatches). It simulates controlled multi-session histories from structured user profiles, injecting cross-session conflicts and semantically similar distractors to create competition among memory candidates.
The framework enables both black-box (final answer correctness) and white-box (supporting-memory retrieval and ranking) evaluation. Tests on six representative long-term memory systems reveal uneven strengths across conflict types. Notably, answer correctness often diverges from memory retrieval and ranking quality. Sensitivity analyses show that longer histories, more distractors, implicit queries, and larger conflict distances all degrade performance. Diagnostics point to failures from missing supporting memories and ineffective use of retrieved memories. MemConflict advances principled memory governance through retrieval-aware, conflict-aware reliability assessment.
- MemConflict tests three conflict types: dynamic, static, and conditional across multi-session recall
- Six long-term memory systems evaluated; answer correctness diverges from retrieval/ranking quality
- Longer histories, distractors, and implicit queries all degrade memory retrieval performance
Why It Matters
As AI agents gain persistent memory, MemConflict provides a rigorous benchmark to ensure they handle conflicting information reliably.