Research & Papers

SocialReasoning-Bench reveals AI agents fail to negotiate in users' best interests

Frontier models accept bad deals and suboptimal meeting times 93% of the time.

Deep Dive

A new benchmark, SocialReasoning-Bench, tests whether AI agents can negotiate on behalf of users in calendar and marketplace settings. Results show frontier models complete tasks but often accept suboptimal outcomes, leaving value on the table. Even with explicit prompting to act in the user’s best interest, performance remains well below what a trustworthy delegate should achieve. The benchmark measures both outcome optimality (value secured for the user) and due diligence (whether they follow a competent decision-making process), highlighting critical gaps in social reasoning as agents like Claude Cowork and Gemini take on real-world tasks.

Key Points
  • SocialReasoning-Bench tests agents in two domains: Calendar Coordination (scheduling) and Marketplace Negotiation (pricing).
  • In simulated marketplace tests, agents accepted the first proposal up to 93% of the time without exploring alternatives.
  • Even with explicit prompting to act in the user's best interest, performance remains below trustworthy delegate levels.

Why It Matters

As AI agents manage calendars and negotiate purchases, lacking social reasoning risks poor outcomes and privacy breaches for users.