Research & Papers

TERMS-Bench: New framework diagnoses LLM negotiation failures beyond deal rate

Frontier LLMs saturate deal rates but still fail at surplus extraction and calibration.

Deep Dive

Researchers (Zhang et al.) introduce TERMS-Bench, a Bayesian-game framework that turns LLM negotiation evaluation from aggregate rankings into actionable diagnosis. Testing 13 frontier models, it reveals that while deal rates are saturated, models diverge sharply in surplus extraction, cue use, belief calibration, and compliance. The framework makes the hidden counterpart an evaluator-observable diagnostic tool, enabling precise failure analysis.

Key Points
  • TERMS-Bench replaces opaque LLM-vs-LLM evaluation with a Bayesian-game framework where the counterpart's hidden state is observable to the evaluator.
  • 13 frontier models (GPT-4, Claude 3, Gemini, Llama 3) tested; all hit high deal rates but show 30-50% gaps in surplus extraction and belief calibration.
  • Enables precise, agent-attributable failure analysis—identifying exactly where each model fails, from cue usage to compliance with binding constraints.

Why It Matters

Reveals critical blind spots in AI negotiation skills, impacting procurement, labor deals, and autonomous market agents.