Pessimism-Free Offline Learning in General-Sum Games via KL Regularization
New GANE algorithm recovers Nash equilibria at O(1/n) rate without manual penalties
Offline multi-agent reinforcement learning struggles with distribution shift when the logged dataset differs from target equilibrium policies. Standard approaches rely on manual pessimistic penalties to counter this shift, which can be brittle and hard to tune. In a new paper, researchers Claire Chen and Yuheng Zhang demonstrate that KL (Kullback–Leibler) regularization alone is sufficient to stabilize learning and recover equilibrium policies without explicit pessimism. They introduce GANE (General-sum Anchored Nash Equilibrium), which recovers regularized Nash equilibria at an accelerated statistical rate of Õ(1/n) — significantly faster than existing methods. For computationally efficient deployment, they also propose GAMD (General-sum Anchored Mirror Descent), an iterative algorithm that converges to a Coarse Correlated Equilibrium at the standard rate of Õ(1/√n + 1/T). The key insight is that KL regularization naturally provides the necessary stability, eliminating the need for handcrafted pessimistic terms.
This work has practical implications for any setting where multiple AI agents must learn from static datasets — from autonomous driving coordination to bidding in ad auctions. By proving that KL regularization is a standalone mechanism for pessimism-free offline learning, the authors simplify algorithm design and reduce hyperparameter tuning. The accelerated Nash equilibrium recovery rate means agents can reach stable, mutually optimal strategies with less data. GAMD’s tractable convergence also makes it deployable in real-world multi-agent systems where computing exact equilibria is intractable. The paper is available on arXiv (2605.00264) and has been submitted to ICML 2026.
- GANE recovers Nash equilibria at an accelerated Õ(1/n) rate, bypassing traditional pessimistic penalties
- GAMD converges to a Coarse Correlated Equilibrium at Õ(1/√n + 1/T) with practical iterative updates
- KL regularization alone suffices to handle distribution shift in general-sum offline multi-agent games
Why It Matters
Simpler, faster offline multi-agent RL without manual tuning — a step toward robust autonomous multi-agent systems.