Research & Papers

Self-play RL collapses below a critical decision capacity threshold

Eliminating all contingent decisions triggers deterministic exploitation attractor at near-max loss

Deep Dive

A new paper from Arahan Kujur, posted on arXiv (2605.16315), identifies a critical threshold in decision capacity that determines whether self-play reinforcement learning (RL) agents collapse under asymmetric rule perturbations. The core finding: when all “positive-reach contingent decisions” — decisions that are both reachable in the game tree and depend on past actions — are removed, the agents rapidly converge to a deterministic exploitation attractor, a fixed point where they suffer near-maximal loss. Remarkably, preserving even a single such decision point prevents this collapse entirely. The study tested poker variants, matrix games, a dice game, and multiple RL algorithms (including Q-learning and policy gradient), and the threshold appears universal across these domains.

The mechanism behind the collapse is co-adaptation under constraint, not the perturbation itself. Control experiments with frozen baselines and fixed opponents confirmed that the agents’ mutual adaptation in a constrained decision space drives them to a pathological fixed point. The phenomenon is timing-invariant — it doesn't matter when decisions are removed — and fully reversible: restoring just one contingent decision instantly restores stable learning. Function approximation (e.g., neural networks) intensifies the collapse, making it more severe. These results establish a sharp threshold at zero reach-weighted contingent action capacity, with the severity scaling continuously via reach-weighted capacity. The paper has immediate implications for training stable multi-agent RL systems, especially in game-theoretic and adversarial settings where self-play is common.

Key Points
  • Removing ALL positive-reach contingent decisions causes rapid collapse to a deterministic exploitation attractor with near-maximal loss across poker, matrix, and dice games.
  • Preserving just ONE such decision point prevents collapse entirely, establishing a sharp zero-capacity threshold.
  • Collapse is reversible upon action restoration, worsens with function approximation, and is timing-invariant and domain-independent.

Why It Matters

This threshold could fundamentally change how we design stable self-play training for competitive AI agents.