Faster Near-Optimal Reinforcement Learning: Adding Adaptiveness to the E3 Algorithm

Authors:
Carlos Domingo
Affiliations:
-
Venue:
ALT '99 Proceedings of the 10th International Conference on Algorithmic Learning Theory
Year:
1999

Citing 9
Cited 1

Efficient sampling strategies for relational database operations

ICDT Selected papers of the 4th international conference on Database theory
An introduction to computational learning theory

An introduction to computational learning theory
Query size estimation by adaptive sampling

Selected papers of the 9th annual ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Near-Optimal Reinforcement Learning in Polynominal Time

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
From Computational Learning Theory to Discovery Science

ICAL '99 Proceedings of the 26th International Colloquium on Automata, Languages and Programming
Practical Algorithms for On-line Sampling

DS '98 Proceedings of the First International Conference on Discovery Science
Reinforcement learning: a survey

Journal of Artificial Intelligence Research
Efficient reinforcement learning in factored MDPs

IJCAI'99 Proceedings of the 16th international joint conference on Artificial intelligence - Volume 2

Sequential Sampling Techniques for Algorithmic Learning Theory

ALT '00 Proceedings of the 11th International Conference on Algorithmic Learning Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recently, Kearns and Singh presented the first provably efficient and near-optimal algorithm for reinforcement learning in general Markov decision processes. One of the key contributions of the algorithm is its explicit treatment of the exploration-exploitation trade off. In this paper, we show how the algorithm can be improved by substituting the exploration phase, that builds a model of the underlying Markov decision process by estimating the transition probabilities, by an adaptive sampling method more suitable for the problem. Our improvement is two-folded. First, our theoretical bound on the worst case time needed to converge to an almost optimal policy is significatively smaller. Second, due to the adaptiveness of the sampling method we use, we discuss how our algorithm might perform better in practice than the previous one.