Incorporating domain models into Bayesian optimization for RL

Authors:
Aaron Wilson;Alan Fern;Prasad Tadepalli
Affiliations:
Oregon State University, School of EECS;Oregon State University, School of EECS;Oregon State University, School of EECS
Venue:
ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part III
Year:
2010

Citing 10
Cited 0

Lipschitzian optimization without the Lipschitz constant

Journal of Optimization Theory and Applications
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
A Bayesian Framework for Reinforcement Learning

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Least-squares policy iteration

The Journal of Machine Learning Research
Reinforcement learning with Gaussian processes

ICML '05 Proceedings of the 22nd international conference on Machine learning
Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)

Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)
Experiments with infinite-horizon, policy-gradient estimation

Journal of Artificial Intelligence Research
Automatic gait optimization with Gaussian process regression

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Practical bayesian optimization

Practical bayesian optimization
Model based Bayesian exploration

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

In many Reinforcement Learning (RL) domains there is a high cost for generating experience in order to evaluate an agent's performance. An appealing approach to reducing the number of expensive evaluations is Bayesian Optimization (BO), which is a framework for global optimization of noisy and costly to evaluate functions. Prior work in a number of RL domains has demonstrated the effectiveness of BO for optimizing parametric policies. However, those approaches completely ignore the state-transition sequence of policy executions and only consider the total reward achieved. In this paper, we study how to more effectively incorporate all of the information observed during policy executions into the BO framework. In particular, our approach uses the observed data to learn approximate transitions models that allow for Monte-Carlo predictions of policy returns. The models are then incorporated into the BO framework as a type of prior on policy returns, which can better inform the BO process. The resulting algorithm provides a new approach for leveraging learned models in RL even when there is no planner available for exploiting those models. We demonstrate the effectiveness of our algorithm in four benchmark domains, which have dynamics of variable complexity. Results indicate that our algorithm effectively combines model based predictions to improve the data efficiency of model free BO methods, and is robust to modeling errors when parts of the domain cannot be modeled successfully.