Research & Papers

A Mathematical Programming Approach to Computing and Learning Berk--Nash Equilibria in Infinite-Horizon MDPs

Researchers tackle AI's 'wrong assumptions' problem with a novel bilevel optimization and bandit learning scheme.

Deep Dive

Researchers Quanyan Zhu and Zhengye Han have published a significant paper tackling a core problem in AI decision-making: what happens when an agent's internal model of the world is fundamentally wrong or 'misspecified.' Within the Berk-Nash equilibrium framework for infinite-horizon Markov Decision Processes (MDPs), stable behavior emerges from a fixed point where the agent acts optimally based on its flawed subjective model, while that model remains statistically consistent with the long-run data its own policy generates. The authors provide a rigorous characterization of this equilibrium using coupled linear programs and a bilevel optimization formulation.

To overcome the non-smoothness of standard best-response functions, the team introduces entropy regularization. This technique establishes the existence of a unique 'soft Bellman' fixed point and creates a smooth objective function, making the problem computationally tractable. Leveraging this new regularity, they develop an online learning algorithm that frames model selection as an adversarial bandit problem. It uses an EXP3-type update strategy, augmented by an innovative 'conjecture-set zooming' mechanism that dynamically refines the parameter search space based on performance. Numerical experiments show the approach successfully balances exploration and exploitation, converges to the KL-divergence-minimizing model, and achieves sublinear regret, meaning its performance loss compared to an optimal agent shrinks over time. The work was accepted at the 15th EAI International Conference on Game Theory for Networks (GameNets 2026).

Key Points
  • Solves model misspecification in MDPs using a novel bilevel optimization formulation for Berk-Nash equilibria.
  • Introduces entropy regularization to create a unique 'soft Bellman' fixed point, enabling tractable computation.
  • Develops an online learning scheme with EXP3 bandit updates and adaptive 'conjecture-set zooming,' demonstrating sublinear regret.

Why It Matters

Makes AI agents more robust in real-world scenarios where their initial assumptions are imperfect, improving reliability for finance, robotics, and strategic planning.