On the convergence of optimistic policy iteration

Authors:
John N. Tsitsiklis
Affiliations:
LIDS, Room 35-209, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA
Venue:
The Journal of Machine Learning Research
Year:
2003

Citing 5
Cited 5

Technical Note: \cal Q-Learning

Machine Learning
Asynchronous Stochastic Approximation and Q-Learning

Machine Learning
Reinforcement learning with replacing eligibility traces

Machine Learning - Special issue on reinforcement learning
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Neuro-Dynamic Programming

Neuro-Dynamic Programming

An Optimal Approximate Dynamic Programming Algorithm for the Lagged Asset Acquisition Problem

Mathematics of Operations Research
Model-free reinforcement learning as mixture learning

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Generalized policy iteration for continuous-time systems

IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
Sampled fictitious play for approximate dynamic programming

Computers and Operations Research
Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

Mathematics of Operations Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider a finite-state Markov decision problem and establish the convergence of a special case of optimistic policy iteration that involves Monte Carlo estimation of Q-values, in conjunction with greedy policy selection. We provide convergence results for a number of algorithmic variations, including one that involves temporal difference learning (bootstrapping) instead of Monte Carlo estimation. We also indicate some extensions that either fail or are unlikely to go through.