Improving optimistic exploration in model-free reinforcement learning

Authors:
Marek Grześ;Daniel Kudenko
Affiliations:
Department of Computer Science, University of York, Heslington, York, United Kingdom;Department of Computer Science, University of York, Heslington, York, United Kingdom
Venue:
ICANNGA'09 Proceedings of the 9th international conference on Adaptive and natural computing algorithms
Year:
2009

Citing 14
Cited 0

Integrated architecture for learning, planning, and reacting based on approximating dynamic programming

Proceedings of the seventh international conference (1990) on Machine learning
Empirical methods for artificial intelligence

Empirical methods for artificial intelligence
Efficient model-based exploration

Proceedings of the fifth international conference on simulation of adaptive behavior on From animals to animats 5
Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty

Machine Learning
Analytical Mean Squared Error Curves for Temporal DifferenceLearning

Machine Learning
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Neuro-Dynamic Programming

Neuro-Dynamic Programming
Learning to Predict by the Methods of Temporal Differences

Machine Learning
Artificial Intelligence: A Modern Approach

Artificial Intelligence: A Modern Approach
Efficient Exploration In Reinforcement Learning

Efficient Exploration In Reinforcement Learning
R-max - a general polynomial time algorithm for near-optimal reinforcement learning

The Journal of Machine Learning Research
Qualitative reinforcement learning

ICML '06 Proceedings of the 23rd international conference on Machine learning
The many faces of optimism: a unifying approach

Proceedings of the 25th international conference on Machine learning
Reinforcement learning: a survey

Journal of Artificial Intelligence Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

The key problem in reinforcement learning is the explorationexploitation tradeoff. An optimistic initialisation of the value function is a popular RL strategy. The problem of this approach is that the algorithm may have relatively low performance after many episodes of learning. In this paper, two extensions to standard optimistic exploration are proposed. The first one is based on different initialisation of the value function of goal states. The second one which builds on the previous idea explicitly separates propagation of low and high values in the state space. Proposed extensions show improvement in empirical comparisons with basic optimistic initialisation. Additionally, they improve anytime performance and help on domains where learning takes place on the subspace of the large state space, that is, where the standard optimistic approach faces more difficulties.