TD-Gammon, a self-teaching backgammon program, achieves master-level play
Neural Computation
Reinforcement learning with replacing eligibility traces
Machine Learning - Special issue on reinforcement learning
Natural gradient works efficiently in learning
Neural Computation
Markov Decision Processes: Discrete Stochastic Dynamic Programming
Markov Decision Processes: Discrete Stochastic Dynamic Programming
Neuro-Dynamic Programming
Kernel-Based Reinforcement Learning
Machine Learning
Open Theoretical Questions in Reinforcement Learning
EuroCOLT '99 Proceedings of the 4th European Conference on Computational Learning Theory
SIAM Journal on Control and Optimization
Tree-Based Batch Mode Reinforcement Learning
The Journal of Machine Learning Research
Finite-Time Bounds for Fitted Value Iteration
The Journal of Machine Learning Research
ECML'05 Proceedings of the 16th European conference on Machine Learning
ECML'05 Proceedings of the 16th European conference on Machine Learning
Teaching a robot to perform tasks with voice commands
MICAI'10 Proceedings of the 9th Mexican international conference on Advances in artificial intelligence: Part I
Reinforcement learning with a bilinear q function
EWRL'11 Proceedings of the 9th European conference on Recent Advances in Reinforcement Learning
Policy iteration based on a learned transition model
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II
Hi-index | 0.00 |
In this paper we address reinforcement learning problems with continuous state-action spaces. We propose a new algorithm, fitted natural actor-critic(FNAC), that extends the work in [1] to allow for general function approximation and data reuse. We combine the natural actor-critic architecture [1] with a variant of fitted value iteration using importance sampling. The method thus obtained combines the appealing features of both approaches while overcoming their main weaknesses: the use of a gradient-based actor readily overcomes the difficulties found in regression methods with policy optimization in continuous action-spaces; in turn, the use of a regression-based critic allows for efficient use of data and avoids convergence problems that TD-based critics often exhibit. We establish the convergence of our algorithm and illustrate its application in a simple continuous space, continuous action problem.