When Reasoning Models Hurt Behavioral Simulation: A Solver-Sampler Mismatch in Multi-Agent LLM Negotiation
GPT-5.2's superior reasoning caused it to fail at simulating realistic human compromise in negotiations.
A new research paper by Sandro Andric challenges a core assumption in AI-driven social science: that more reasoning capability always leads to better simulations of human behavior. The study, titled 'When Reasoning Models Hurt Behavioral Simulation,' demonstrates that in multi-agent negotiation scenarios—like fragmented-authority trading or emergency electricity management—models optimized as 'solvers' can fail as 'samplers' of plausible human action. Advanced models like GPT-5.2, when using their full 'native reasoning,' over-optimize for strategically dominant outcomes, collapsing the diversity and compromise seen in real-world interactions.
In a direct test using OpenAI's models, GPT-5.2 with native reasoning resulted in authority-imposed decisions in all 45 simulation runs across three different environments. In stark contrast, the same GPT-5.2 model with 'bounded reflection'—a constrained form of reasoning—successfully recovered compromise outcomes in every case. This reveals a 'solver-sampler mismatch,' where the very capability that makes an LLM powerful at problem-solving (like GPT-5.2's advanced reasoning) makes it worse at simulating the bounded rationality and suboptimal compromises of real human negotiators.
The paper's contribution is a crucial methodological warning for economists, policy designers, and anyone using LLMs for simulation. It argues that behavioral simulation requires qualifying models specifically as 'samplers,' not just evaluating them as 'solvers' on standard benchmarks. The optimal approach for simulation fidelity may not be the most capable model, but the one with the right type of constrained cognitive processing to mimic human-like behavior.
- GPT-5.2 with full 'native reasoning' failed to simulate compromise, forcing authority decisions in 45 out of 45 negotiation runs.
- A 'bounded reflection' setting, which constrains reasoning, produced significantly more diverse and realistic behavioral trajectories than either no reasoning or full reasoning.
- The research identifies a 'solver-sampler mismatch,' showing that a model's problem-solving capability (being a good 'solver') is distinct from its ability to simulate human behavior (being a good 'sampler').
Why It Matters
This forces a rethink of how AI is used for policy and economic simulations, where realistic human behavior, not optimal solutions, is the goal.