Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Control of exploitation-exploration meta-parameter in reinforcement learning
Neural Networks - Computational models of neuromodulation
Off-Policy Temporal Difference Learning with Function Approximation
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Scalable Internal-State Policy-Gradient Methods for POMDPs
ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
ICANN/ICONIP'03 Proceedings of the 2003 joint international conference on Artificial neural networks and neural information processing
Feature extraction for decision-theoretic planning in partially observable environments
ICANN'06 Proceedings of the 16th international conference on Artificial Neural Networks - Volume Part I
Hi-index | 0.00 |
There has been a problem called "exploration-exploitation problem" in the field of reinforcement learning. An agent must decide whether to explore a better action which may not necessarily exist, or to exploit many rewards by taking the current best action. In this article, we propose an off-policy reinforcement learning method based on a natural policy gradient learning, as a solution of the exploration-exploitation problem. In our method, the policy gradient is estimated based on a sequence of state-action pairs sampled by performing an arbitrary "behavior policy"; this allows us to deal with the exploration-exploitation problem by handling the generation process of behavior policies. By applying to an autonomous control problem of a three-dimensional cartpole, we show that our method can realize an optimal control efficiently in a partially observable domain.