Research & Papers

BehaviorBench: New Benchmark Tests AI Personalization on Real User Decisions

1.6M real-world decision instances reveal when personalization actually helps AI models.

Deep Dive

Researchers led by Liangwei Yang introduced BehaviorBench, a benchmark that evaluates how well AI models can personalize decisions using real-world behavioral traces. Unlike existing benchmarks that rely on simulated users or model-generated behavior—which often diverge from human actions—BehaviorBench reconstructs wallet-level decision histories from observed public prediction-market and on-chain records. The benchmark organizes these records into two complementary task layers: Belief prediction (predicting a user's final stance and confidence) and Trade prediction (predicting direction and amount of individual transactions). It spans 2,000 evaluation wallets with 141,445 Belief instances and 1,485,972 Trade instances, using disjoint support pools for retrieval-based evaluation.

The researchers tested frontier and open-weight generative models under four history interfaces: no personalization, direct recent history, generated user profiles, and retrieved support-wallet evidence. Results show personalization consistently improves Belief prediction more than Trade prediction, and model rankings change across task layers and metrics. Different history interfaces expose different failure modes. BehaviorBench provides a robust evaluation setting for studying whether personalized methods can use real-world behavioral evidence rather than relying solely on simulated users, offering a more realistic path for decision-support AI systems.

Key Points
  • Benchmark uses real prediction-market and on-chain data, not synthetic simulations, ensuring behavioral authenticity.
  • Contains 141,445 Belief instances and 1,485,972 Trade instances across 2,000 wallets for large-scale evaluation.
  • Personalization boosts Belief prediction accuracy more than Trade prediction, and model rankings shift depending on the task and interface.

Why It Matters

Moves AI personalization evaluation from synthetic simulations to real-world behavior, enabling more reliable decision-support systems.