Lipschitzian optimization without the Lipschitz constant
Journal of Optimization Theory and Applications
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
A Bayesian Framework for Reinforcement Learning
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Least-squares policy iteration
The Journal of Machine Learning Research
Reinforcement learning with Gaussian processes
ICML '05 Proceedings of the 22nd international conference on Machine learning
Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)
Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)
Experiments with infinite-horizon, policy-gradient estimation
Journal of Artificial Intelligence Research
Automatic gait optimization with Gaussian process regression
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Practical bayesian optimization
Practical bayesian optimization
Model based Bayesian exploration
UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
Hi-index | 0.00 |
In many Reinforcement Learning (RL) domains there is a high cost for generating experience in order to evaluate an agent's performance. An appealing approach to reducing the number of expensive evaluations is Bayesian Optimization (BO), which is a framework for global optimization of noisy and costly to evaluate functions. Prior work in a number of RL domains has demonstrated the effectiveness of BO for optimizing parametric policies. However, those approaches completely ignore the state-transition sequence of policy executions and only consider the total reward achieved. In this paper, we study how to more effectively incorporate all of the information observed during policy executions into the BO framework. In particular, our approach uses the observed data to learn approximate transitions models that allow for Monte-Carlo predictions of policy returns. The models are then incorporated into the BO framework as a type of prior on policy returns, which can better inform the BO process. The resulting algorithm provides a new approach for leveraging learned models in RL even when there is no planner available for exploiting those models. We demonstrate the effectiveness of our algorithm in four benchmark domains, which have dynamics of variable complexity. Results indicate that our algorithm effectively combines model based predictions to improve the data efficiency of model free BO methods, and is robust to modeling errors when parts of the domain cannot be modeled successfully.