ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders
New benchmark reveals LLM-based user simulators fail to match real human interactions, risking flawed AI assistants.
Researchers from Google, Technion, and University of Washington introduced ConvApparel, a benchmark dataset and validation framework for conversational AI recommenders. It contains human-AI conversations collected using both 'good' and 'bad' recommenders, enriched with user satisfaction annotations. Their framework combines statistical alignment, human-likeness scores, and counterfactual validation. Experiments show a significant 'realism gap' in all tested simulators, though data-driven models outperform prompted baselines in adapting to unseen user behaviors.
Why It Matters
This provides a crucial tool to build AI shopping assistants and customer service bots that perform reliably with real people, not just in simulations.