LifeBench: A Benchmark for Long-Horizon Multi-Source Memory
New benchmark pushes AI beyond simple recall, requiring integration of multiple memory types across extended timelines.
A research team led by Zihao Cheng and 17 other authors has introduced LifeBench, a groundbreaking benchmark designed to evaluate AI agents' ability to handle long-horizon, multi-source memory tasks. Unlike existing benchmarks that focus primarily on declarative memory (semantic and episodic information explicitly presented in dialogues), LifeBench addresses the critical gap in non-declarative memory—including habitual and procedural types that must be inferred from diverse digital traces. The benchmark features densely connected, long-horizon event simulations that push AI beyond simple recall, requiring sophisticated integration of different memory types across extended temporal contexts. This represents a significant step toward creating personalized agents capable of accumulating knowledge, reasoning over user experiences, and adapting behavior over time.
To ensure data quality and scalability, the researchers employed real-world priors including anonymized social surveys, map APIs, and holiday-integrated calendars, enforcing fidelity, diversity, and behavioral rationality. They drew inspiration from cognitive science, structuring events according to their partonomic hierarchy to enable efficient parallel generation while maintaining global coherence. The results reveal the current limitations of AI memory systems: top-tier models achieve just 55.2% accuracy on LifeBench tasks, demonstrating the inherent difficulty of long-horizon retrieval and multi-source integration. With the dataset and synthesis code publicly available, this benchmark provides a crucial testing ground for developing more capable AI agents that can navigate complex, real-world scenarios requiring sustained memory and reasoning.
- Benchmark tests AI on long-horizon multi-source memory, integrating declarative and non-declarative types
- Uses real-world data sources including social surveys, map APIs, and calendars for behavioral realism
- State-of-the-art memory systems achieve only 55.2% accuracy, highlighting significant performance gap
Why It Matters
Enables development of AI agents that can truly learn from experience and adapt over time, moving beyond simple pattern recognition.