Learning exploration strategies in model-based reinforcement learning

Authors:
Todd Hester;Manuel Lopes;Peter Stone
Affiliations:
University of Texas at Austin, Austin, TX, USA;INRIA Bordeaux Sud-Ouest, Bordeaux, France;University of Texas at Austin, Austin, TX, USA
Venue:
Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems
Year:
2013

Citing 12
Cited 0

Simulation and the Monte Carlo Method

Simulation and the Monte Carlo Method
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Random Forests

Machine Learning
Finite-time Analysis of the Multiarmed Bandit Problem

Machine Learning
Eligibility Traces for Off-Policy Policy Evaluation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Online Choice of Active Learning Algorithms

The Journal of Machine Learning Research
The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Near-Bayesian exploration in polynomial time

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Building portable options: skill transfer in reinforcement learning

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
R-MAX: a general polynomial time algorithm for near-optimal reinforcement learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
R-IAC: Robust Intrinsically Motivated Exploration and Active Learning

IEEE Transactions on Autonomous Mental Development
Strong mitigation: nesting search for good policies within search for good reward

Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

Reinforcement learning (RL) is a paradigm for learning sequential decision making tasks. However, typically the user must hand-tune exploration parameters for each different domain and/or algorithm that they are using. In this work, we present an algorithm called LEO for learning these exploration strategies on-line. This algorithm makes use of bandit-type algorithms to adaptively select exploration strategies based on the rewards received when following them. We show empirically that this method performs well across a set of five domains. In contrast, for a given algorithm, no set of parameters is best across all domains. Our results demonstrate that the LEO algorithm successfully learns the best exploration strategies on-line, increasing the received reward over static parameterizations of exploration and reducing the need for hand-tuning exploration parameters.