Optimal tuning of continual online exploration in reinforcement learning

Authors:
Youssef Achbany;Francois Fouss;Luh Yen;Alain Pirotte;Marco Saerens
Affiliations:
Information Systems Research Unit (ISYS), Place des Doyens 1, Université de Louvain, Belgium;Information Systems Research Unit (ISYS), Place des Doyens 1, Université de Louvain, Belgium;Information Systems Research Unit (ISYS), Place des Doyens 1, Université de Louvain, Belgium;Information Systems Research Unit (ISYS), Place des Doyens 1, Université de Louvain, Belgium;Information Systems Research Unit (ISYS), Place des Doyens 1, Université de Louvain, Belgium
Venue:
ICANN'06 Proceedings of the 16th international conference on Artificial Neural Networks - Volume Part I
Year:
2006

Citing 9
Cited 4

Elements of information theory

Elements of information theory
Technical Note: \cal Q-Learning

Machine Learning
Reinforcement learning with replacing eligibility traces

Machine Learning - Special issue on reinforcement learning
Dynamic Programming and Optimal Control

Dynamic Programming and Optimal Control
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Neuro-Dynamic Programming

Neuro-Dynamic Programming
Introduction to Stochastic Search and Optimization

Introduction to Stochastic Search and Optimization
Efficient Exploration In Reinforcement Learning

Efficient Exploration In Reinforcement Learning
Graph theory: An algorithmic approach (Computer science and applied mathematics)

Graph theory: An algorithmic approach (Computer science and applied mathematics)

Dynamic task allocation within an open service-oriented MAS architecture

Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems
QL2, a simple reinforcement learning scheme for two-player zero-sum Markov games

Neurocomputing
Randomized shortest-path problems: Two related models

Neural Computation
A sum-over-paths extension of edit distances accounting for all sequence alignments

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a framework allowing to tune continual exploration in an optimal way. It first quantifies the rate of exploration by defining the degree of exploration of a state as the probability-distribution entropy for choosing an admissible action. Then, the exploration/exploitation tradeoff is stated as a global optimization problem: find the exploration strategy that minimizes the expected cumulated cost, while maintaining fixed degrees of exploration at same nodes. In other words, “exploitation” is maximized for constant “exploration”. This formulation leads to a set of nonlinear updating rules reminiscent of the value-iteration algorithm. Convergence of these rules to a local minimum can be proved for a stationary environment. Interestingly, in the deterministic case, when there is no exploration, these equations reduce to the Bellman equations for finding the shortest path while, when it is maximum, a full “blind” exploration is performed.