Gradient Temporal Difference-Difference Q-Learning for Control

Matthew Trappett

Reinforcement Learning has proven to be an effective method for solving difficult problems formulated as sequential decision making tasks. While modern Reinforcement Learning (RL) algorithms are robust, they are not guaranteed to converge to a solution when using a combination of function approximation, Temporal Difference (TD) updates and off-policy learning. Gradient Temporal Difference (GTD) RL algorithms have been developed and proven to be mathematically convergent. However, in practice these algorithms are slow to learn. To improve performance while maintaining convergence guarantees, we utilize a second-order optimization constraint to implement a new algorithm, Gradient Temporal Difference-Difference Q-learning. We evaluate its performance against other GTD algorithms in the linear function approximation and actorcritic regimes on classic control environments. Our results show that our algorithm improves performance while maintaining mathematical stability.