Research & Papers

Regularized Centered Emphatic Temporal Difference Learning

A simple regularization parameter c lifts the key matrix block, preventing divergence in centered ETD.

Deep Dive

Off-policy temporal-difference (TD) learning with function approximation faces a fundamental tradeoff among stability, projection geometry, and variance control. Emphatic TD (ETD) improves off-policy projection geometry through follow-on emphasis, but the follow-on trace can introduce high variance. Bellman-error centering naturally removes a common drift term from TD errors, but the researchers show that a naive centered emphatic extension introduces an auxiliary coupling that can destroy the positive-definiteness of the ETD key matrix, leading to instability.

To solve this, the team proposes Regularized Centered Emphatic Temporal Difference Learning (RETD). RETD preserves the follow-on trace and regularizes only the auxiliary centering recursion, effectively lifting the lower-right block of the coupled key matrix from 1 to 1+c. They derive the RETD core matrix, prove convergence under a conservative sufficient regularization condition, and evaluate the method on diagnostic linear off-policy prediction tasks. The experiments demonstrate that RETD avoids the instability of naive centered emphatic learning, preserves favorable emphatic geometry, and exhibits a robust intermediate regime for the regularization parameter c across the diagnostics.

Key Points
  • Naive centered emphatic TD destroys positive-definiteness of the key matrix due to auxiliary coupling.
  • RETD adds a regularization parameter c, lifting the lower-right block from 1 to 1+c to ensure stability.
  • Experiments show RETD avoids instability across diagnostic linear off-policy tasks with robust intermediate c regime.

Why It Matters

Improves stability and reliability of off-policy reinforcement learning algorithms, enabling safer deployment in real-world applications.