Research & Papers

What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control

Researchers open the black box on Llama-3 to find cooperation overwrites rational play.

Deep Dive

Researchers Paraskevas Lekeas and Giorgos Stamatopoulos have published a mechanistic analysis of why large language models fail to play Nash equilibrium strategies in game theory settings. Testing four open-source models (Llama-3 and Qwen2.5, ranging from 8B to 72B parameters) across four canonical two-player games, they first established behavioral patterns through self-play and cross-play experiments. They then opened the 32-layer Llama-3-8B model to trace what happens internally during strategic decisions.

The mechanistic findings are striking: opponent move history is encoded with near-perfect fidelity (96% probe accuracy) from the very first layer and consumed progressively by later layers. However, Nash action encoding remains weak throughout the model, never exceeding 56% accuracy. The model actually computes the Nash action and favors it through most of its forward pass, but a prosocial override concentrated in the final layers reverses this, driving cooperation probability to 84% at layer 30. By injecting a learned Nash direction into the residual stream via concept clamping, the researchers could shift behavior bidirectionally—proving the model possesses Nash-playing competence but actively suppresses it.

The behavioral experiments also uncovered scale- and architecture-dependent results. Chain-of-thought reasoning worsens Nash play in small models but achieves near-perfect play above 70B parameters. Cross-play experiments reveal three phenomena invisible in self-play: a small model can unravel any partner's cooperation by defecting early; two large models reinforce each other's cooperative instincts indefinitely; and first-mover advantage in coordination games determines which Nash equilibrium is reached.

Key Points
  • Opponent history encoded with 96% accuracy at layer 1, but Nash action encoding never exceeds 56%.
  • Prosocia l override in final layers pushes cooperation to 84% by layer 30, suppressing Nash play.
  • Chain-of-thought reasoning worsens Nash play in models under 70B but achieves near-perfect play above 70B parameters.

Why It Matters

Mechanistic understanding of LLM strategic behavior enables better control for multi-agent AI systems and game-theoretic applications.