AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment
Researchers unveil 2,500 real dialogue sequences showing AI assistants fail at remembering user preferences.
A team of researchers led by Jianfei Xiao has introduced AlpsBench, a groundbreaking benchmark designed to rigorously evaluate how well Large Language Models (LLMs) like GPT-4 and Claude can be personalized for individual users. Unlike previous benchmarks that relied on synthetic, artificial dialogues, AlpsBench is built from 2,500 real-world, long-term interaction sequences sourced from the WildChat dataset. Each sequence is paired with human-verified structured memories that capture both explicit facts and implicit user preferences, creating a realistic testbed for AI assistants that aim to learn and adapt over time.
The benchmark defines four critical tasks that span the entire lifecycle of memory management: extracting personal information from conversations, updating that information as new details emerge, retrieving the correct memory when needed, and utilizing it to generate preference-aligned responses. Initial evaluations of frontier models and memory systems on AlpsBench have exposed significant shortcomings. Models show a surprising inability to reliably infer latent user traits from dialogue. Even the strongest systems hit a performance ceiling when updating memories, and retrieval accuracy drops sharply when models must sift through large pools of distracting, irrelevant information.
Perhaps most importantly, the findings challenge a common assumption: simply giving a model an explicit memory mechanism does not guarantee it will produce responses that are more emotionally resonant or better aligned with a user's deeper preferences. This highlights the complex gap between factual recall and true personal understanding. AlpsBench aims to move the field beyond simple performance metrics and provide a comprehensive, realistic framework for developing AI that can function as a genuine lifelong companion, making personalization research more measurable and directly applicable to real-world products.
- Built from 2,500 real human-LLM dialogue sequences from WildChat, avoiding the distribution gap of synthetic data.
- Tests four core personalization tasks: information extraction, updating, retrieval, and utilization with human-verified memories.
- Reveals models fail at inferring latent traits, memory updating has a ceiling, and retrieval fails with many distractors.
Why It Matters
Provides the first gold-standard test for building AI assistants that truly remember and adapt to individual users over time.