Continuous-time q-learning for mean-field control with common noise, part-I: Theoretical foundations
Researchers prove policy iteration convergence and define an integrated q-function for multi-agent settings.
This paper, part of a two-part series, tackles continuous-time Q-learning for mean-field control (MFC) when a common noise source affects all agents simultaneously—a realistic scenario for applications like smart grids, robotic swarms, or financial markets. The authors first bridge the gap between exploratory (sampled actions) and relaxed control formulations, proving that value functions converge as the sampling time grid refines. They then derive an exploratory Hamilton-Jacobi-Bellman equation for the relaxed control problem, noting that the common noise introduces a nonlinear functional dependency on the policy, which complicates standard policy iteration. Under a concavity condition on the value function with respect to the policy, they establish existence and uniqueness of the optimal one-step policy improvement via a first-order condition involving partial linear functional derivatives.
The core innovation is the introduction of an integrated q-function (Iq-function), defined on the state distribution (the mean field) and the policy. Unlike the standard Q-function, the Iq-function captures the aggregate effect of common noise across all agents, allowing the optimal policy to be characterized as a two-layer fixed point: first, the policy must maximize the Iq-function given the current distribution; second, the distribution must be a fixed point of the resulting dynamics. This generalizes previous single-agent results to the mean-field setting. As a concrete example, the authors derive explicit closed-form Gaussian optimal policies for the linear-quadratic (LQ) case, demonstrating tractability. This work provides a rigorous theoretical foundation for deploying continuous-time reinforcement learning in large-scale multi-agent systems with shared randomness.
- Value functions for discretely sampled actions converge to relaxed control values as the time grid refines.
- Under a concavity condition, a unique optimal one-step policy iteration exists via a first-order condition using partial linear functional derivatives.
- An integrated q-function (Iq-function) on state distribution and policy identifies the optimal policy as a two-layer fixed point, with explicit Gaussian solutions for linear-quadratic settings.
Why It Matters
Provides a rigorous theoretical backbone for continuous-time multi-agent RL in systems with common noise, enabling better coordination in autonomous vehicles, energy grids, and finance.