Improving Gaussian process value function approximation in policy gradient algorithms

Authors:
Hunor Jakab;Lehel Csató
Affiliations:
Babeş-Bolyai University, Cluj-Napoca, Romania and Eötvös Loránd University, Budapest, Hungary;Babeş-Bolyai University, Cluj-Napoca, Romania and Eötvös Loránd University, Budapest, Hungary
Venue:
ICANN'11 Proceedings of the 21st international conference on Artificial neural networks - Volume Part II
Year:
2011

Citing 9
Cited 0

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Machine Learning
Gradient descent for general reinforcement learning

Proceedings of the 1998 conference on Advances in neural information processing systems II
Markov Decision Processes: Discrete Stochastic Dynamic Programming

Markov Decision Processes: Discrete Stochastic Dynamic Programming
Reinforcement learning with Gaussian processes

ICML '05 Proceedings of the 22nd international conference on Machine learning
Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)

Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)
2008 Special Issue: Reinforcement learning of motor skills with policy gradients

Neural Networks
Geodesic Gaussian kernels for value function approximation

Autonomous Robots
Gaussian process dynamic programming

Neurocomputing
Importance Sampling for Continuous Time Bayesian Networks

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

The use of value-function approximation in reinforcement learning (RL) problems is widely studied, the most common application of it being the extension of value-based RL methods to continuous domains. Gradient-based policy search algorithms can also benefit from the availability of an estimated value-function, as this estimation can be used for gradient variance reduction. In this article we present a new value function approximation method that uses a modified version of the Kullback-Leibler (KL) distance based sparse on-line Gaussian process regression. We combine it with Williams' episodic REINFORCE algorithm to reduce the variance of the gradient estimates. A significant computational overload of the algorithm is caused by the need to completely re-estimate the value-function after each gradient update step. To overcome this problem we propose a measure composed of a KL distance-based score and a time dependent factor to exchange obsolete basis vectors with newly acquired measurements. This method leads to a more stable estimation of the action value-function and also reduces gradient variance. Performance and convergence comparisons are provided for the described algorithm, testing it on a dynamic system control problem with continuous state-action space.