BA-TDRC stabilizes off-policy RL with behavior-aware auxiliary corrections
New approach replaces covariance matrix with behavior Bellman matrix for robust learning
Reinforcement learning often relies on temporal-difference (TD) learning, but off-policy sampling can cause instability. The classic TDC algorithm adds an auxiliary covariance correction, while TDRC regularizes that correction. In a new arXiv paper, Xingguo Chen and colleagues introduce BA-TDC and BA-TDRC, which replace the covariance matrix with the behavior Bellman matrix (A_μ). This behavior-aware geometry better captures the transition dynamics of the policy that generated the data.
The authors prove that their methods preserve fixed points and converge almost surely under a Hurwitz stability condition. On benchmarks like Baird's counterexample and Random Walk, BA-TDC alone provides significant gains, but regularization in BA-TDRC is essential for robust performance across diverse settings. The work offers a tractable model for understanding correction dynamics in neural-network value approximation, where feature covariances and temporal transitions jointly shape last-layer updates. This could lead to more stable deep RL algorithms.
- BA-TDC replaces the auxiliary covariance matrix with the behavior Bellman matrix for off-policy TD learning
- Almost-sure convergence is proven under Hurwitz stability on the mean system
- Regularization (BA-TDRC) is required for robust performance on harder benchmarks like Baird's counterexample
Why It Matters
Stable off-policy learning accelerates RL applications in robotics, gaming, and autonomous systems.