An autonomous explore/exploit strategy

Authors:
Alex McMahon;Dan Scott;Will Browne
Affiliations:
University of Reading, Berkshire, UK;University of Reading, Berkshire, UK;University of Reading, Berkshire, UK
Venue:
GECCO '05 Proceedings of the 7th annual workshop on Genetic and evolutionary computation
Year:
2005

Citing 4
Cited 3

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Biasing Exploration in an Anticipatory Learning Classifier System

IWLCS '01 Revised Papers from the 4th International Workshop on Advances in Learning Classifier Systems
Rule-based evolutionary online learning systems: learning bounds, classification, and prediction

Rule-based evolutionary online learning systems: learning bounds, classification, and prediction
Classifier fitness based on accuracy

Evolutionary Computation

On-line evolutionary computation for reinforcement learning in stochastic domains

Proceedings of the 8th annual conference on Genetic and evolutionary computation
Optimal contraction theorem for exploration-exploitation tradeoff in search and optimization

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
Balancing Parent and Offspring Selection in Genetic Programming

AI '09 Proceedings of the 22nd Australasian Joint Conference on Advances in Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

In reinforcement learning problems it has been considered that neither exploitation nor exploration can be pursued exclusively without failing at the task. The optimal balance between exploring and exploiting changes as the training progresses due to the increasing amount of learnt knowledge. This shift in balance is not known a priori so an autonomous online adjustment is sought. Human beings manage this balance through logic and explorations based on feedback from the environment. The XCS learning classifier system uses a fixed explore/exploit balance, but does keep multiple statistics about its performance and interaction in an environment. Utilising these statistics in a non-linear manner, autonomous adjustment of the explore/exploit balance was achieved. This resulted in reduced exploration in simple environments, which could increase with the complexity of the problem domain. It also prevented unsuccessful 'loop' exploit trials and suggests a method of dynamic choice in goal setting.