AI Safety

LLMs struggle with simplified cooperative poker logic

Frontier LLMs like GPT-4o and Claude 3.7 fail at token-based communication in a Texas Hold'em variant

Deep Dive

Researchers analyzed how large language models (LLMs) handle 'The Gang', a cooperative poker variant where players communicate hidden hand rankings through token actions. In this simplified Texas Hold'em derivative, players take turns claiming tokens numbered 0 to N-1, where token K indicates the player's hand beats exactly K others. The only communication channel is the sequence of token actions, with victory requiring all players to hold the correct token.

The study found that while local open-source models like Qwen3.6 and Llama3.3:70b outperformed random chance, frontier models struggled with the game's logic. A deterministic solution exists that completes a four-player game in seven rounds, but LLMs failed to replicate this reasoning. The research highlights the challenges LLMs face in multi-agent scenarios requiring precise, rule-based communication.

Key Points
  • LLMs like Qwen3.6 and Llama3.3:70b outperformed random chance but struggled with game logic
  • The Gang uses token actions (0 to N-1) to communicate hand rankings in a simplified poker variant
  • A deterministic solution completes a four-player game in seven rounds, but LLMs failed to match it

Why It Matters

Reveals LLMs' limitations in multi-agent systems requiring precise, rule-based communication for strategic coordination