Tuning continual exploration in reinforcement learning: An optimality property of the Boltzmann strategy

Authors:
Youssef Achbany;François Fouss;Luh Yen;Alain Pirotte;Marco Saerens
Affiliations:
Information Systems Research Unit (ISYS), Louvain School of Management, Université de Louvain, Place des Doyens 1, Louvain-la-Neuve, B-1348, Belgium;Facultés Universitaires Catholiques de Mons (FUCaM), Chaussée de Binche 151, Mons, B-7000, Belgium;Information Systems Research Unit (ISYS), Louvain School of Management, Université de Louvain, Place des Doyens 1, Louvain-la-Neuve, B-1348, Belgium;Information Systems Research Unit (ISYS), Louvain School of Management, Université de Louvain, Place des Doyens 1, Louvain-la-Neuve, B-1348, Belgium;Information Systems Research Unit (ISYS), Louvain School of Management, Université de Louvain, Place des Doyens 1, Louvain-la-Neuve, B-1348, Belgium
Venue:
Neurocomputing
Year:
2008

Citing 21
Cited 4

Search in Artificial Intelligence

Search in Artificial Intelligence
Elements of information theory

Elements of information theory
Technical Note: \cal Q-Learning

Machine Learning
Reinforcement learning with replacing eligibility traces

Machine Learning - Special issue on reinforcement learning
Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty

Machine Learning
On the existence of fixed points for approximate value iteration and temporal-difference learning

Journal of Optimization Theory and Applications
Dynamic Programming and Optimal Control

Dynamic Programming and Optimal Control
Decision Theory: An Introduction to Dynamic Programming and Sequential Decisions

Decision Theory: An Introduction to Dynamic Programming and Sequential Decisions
Markov Decision Processes: Discrete Stochastic Dynamic Programming

Markov Decision Processes: Discrete Stochastic Dynamic Programming
Machine Learning

Machine Learning
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Neuro-Dynamic Programming

Neuro-Dynamic Programming
Introduction to Stochastic Search and Optimization

Introduction to Stochastic Search and Optimization
A new paradigm for ranking pages on the world wide web

WWW '03 Proceedings of the 12th international conference on World Wide Web
Efficient Exploration In Reinforcement Learning

Efficient Exploration In Reinforcement Learning
Fastest Mixing Markov Chain on a Graph

SIAM Review
Graph theory: An algorithmic approach (Computer science and applied mathematics)

Graph theory: An algorithmic approach (Computer science and applied mathematics)
The Fastest Mixing Markov Process on a Graph and a Connection to a Maximum Variance Unfolding Problem

SIAM Review
Random-Walk Computation of Similarities between Nodes of a Graph with Application to Collaborative Recommendation

IEEE Transactions on Knowledge and Data Engineering
Reinforcement learning: a survey

Journal of Artificial Intelligence Research
A new Q-learning algorithm based on the metropolis criterion

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics

Randomized shortest-path problems: Two related models

Neural Computation
Evaluating Q-learning policies for multi-objective foraging task in a multi-agent environment

ICIRA'10 Proceedings of the Third international conference on Intelligent robotics and applications - Volume Part II
A sum-over-paths extension of edit distances accounting for all sequence alignments

Pattern Recognition
Duty cycle learning algorithm (DCLA) for IEEE 802.15.4 beacon-enabled wireless sensor networks

Ad Hoc Networks

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper presents a model allowing to tune continual exploration in an optimal way by integrating exploration and exploitation in a common framework. It first quantifies exploration by defining the degree of exploration of a state as the entropy of the probability distribution for choosing an admissible action in that state. Then, the exploration/exploitation tradeoff is formulated as a global optimization problem: find the exploration strategy that minimizes the expected cumulated cost, while maintaining fixed degrees of exploration at the states. In other words, maximize exploitation for constant exploration. This formulation leads to a set of nonlinear iterative equations reminiscent of the value-iteration algorithm and demonstrates that the Boltzmann strategy based on the Q-value is optimal in this sense. Convergence of those equations to a local minimum is proved for a stationary environment. Interestingly, in the deterministic case, when there is no exploration, these equations reduce to the Bellman equations for finding the shortest path. Furthermore, if the graph of states is directed and acyclic, the nonlinear equations can easily be solved by a single backward pass from the destination state. Stochastic shortest-path problems and discounted problems are also studied, and links between our algorithm and the SARSA algorithm are examined. The theoretical results are confirmed by simple simulations showing that the proposed exploration strategy outperforms the @e-greedy strategy.