Research & Papers

New STHTD-MP algorithm accelerates off-policy RL prediction via behavior-induced metric

Replaces covariance metric with Bellman matrix for faster convergence in temporal-difference learning.

Deep Dive

In a new preprint (arXiv:2605.28849), Xingguo Chen and colleagues introduce STHTD-MP (Behavior-Induced Mirror-Prox Temporal-Difference Learning), a method designed to speed up off-policy prediction in reinforcement learning. Off-policy prediction is critical for learning from data generated by a different behavior policy, but gradient temporal-difference (GTD) methods often converge slowly due to poor metric choices. The key innovation: instead of using the feature covariance matrix as the geometry for the Mirror-Prox primal-dual saddle-point formulation, STHTD-MP employs the symmetric part of the behavior-policy Bellman matrix. This behavior-induced metric captures transition dynamics between states, providing a more informative update direction.

The authors prove formal convergence under standard stochastic approximation assumptions—positivity of the induced metric, Hurwitz stability of the joint mean system, and boundedness via Lyapunov arguments. They derive ergodic gap bounds and a mean-operator comparison with GTD2-MP, showing that STHTD-MP can achieve a smaller mean contraction factor when the behavior-induced metric improves saddle-point geometry. Empirical evaluations on two-state, Random Walk, and Boyan Chain benchmarks confirm the theoretical advantage. Notably, Baird’s counterexample is identified as a singular case where strict assumptions fail. The paper also notes that STHTD-MP keeps a single learning rate for both primal and auxiliary variables, simplifying tuning. This work bridges the gap between Mirror-Prox and hybrid TD methods, offering a practical path to faster, more stable off-policy learning.

Key Points
  • Introduces behavior-induced metric using the symmetric part of the Bellman matrix, replacing the standard covariance metric.
  • Formal convergence guarantees with ODE method, Lyapunov boundedness, and ergodic gap bounds; demonstrates smaller mean contraction factor vs GTD2-MP.
  • Empirical validation on three benchmarks (two-state, Random Walk, Boyan Chain) with Baird’s counterexample as a singular boundary case.

Why It Matters

Faster off-policy prediction means more sample-efficient RL agents, critical for real-world applications like robotics and autonomous driving.