Sharp asymptotic theory for Q-learning with LDTZ learning rate and its generalization
New theory explains why a simple learning rate schedule outperforms standard methods in reinforcement learning.
A team of researchers has published a significant theoretical advance for Q-learning, a foundational reinforcement learning algorithm. Their paper, "Sharp asymptotic theory for Q-learning with LDTZ learning rate and its generalization," provides the first rigorous mathematical analysis of the Linear Decay to Zero (LD2Z) learning schedule, where the step-size η_t,n = η(1 - t/n). This schedule has been noted for strong empirical performance but lacked theoretical grounding. The authors generalize this to a Power-law Decay to Zero (PD2Z-ν) class and deliver non-asymptotic error bounds, a central limit theorem for a new tail-averaged estimator, and a strong invariance principle for the Q-learning iterate process.
These contributions definitively prove that LD2Z/PD2Z schedules achieve a 'best-of-both-worlds' property. They inherit the fast initial error reduction typical of constant learning rates while retaining the asymptotic convergence and bias elimination guaranteed by polynomially decaying schedules. This resolves a core trade-off in RL optimization theory. Furthermore, the established statistical tools, like the Gaussian approximation, enable practical bootstrap-based inference for the learned Q-values, allowing for confidence intervals and more robust policy evaluation. The work, accepted at ICLR 2026, bridges a critical gap between practice and theory in reinforcement learning.
- Proves LD2Z learning schedule combines fast initial convergence with asymptotic optimality, solving a key RL trade-off.
- Introduces a new 'tail' Polyak-Ruppert averaging estimator and provides a central limit theory for statistical inference.
- Establishes a strong invariance principle, enabling bootstrap methods to construct confidence intervals for Q-learning outcomes.
Why It Matters
Provides a mathematically sound, high-performance default for RL training and enables statistical confidence in AI decision-making policies.