Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path

Authors:
András Antos;Csaba Szepesvári;Rémi Munos
Affiliations:
Computer and Automation Research Inst. of the Hungarian Academy of Sciences, Budapest, Hungary;Computer and Automation Research Inst. of the Hungarian Academy of Sciences, Budapest, Hungary;Centre de Mathématiques Appliquées, Ecole Polytechnique, Palaiseau, France
Venue:
COLT'06 Proceedings of the 19th annual conference on Learning Theory
Year:
2006

Citing 12
Cited 5

Sphere packing numbers for subsets of the Boolean n-cube with bounded Vapnik-Chervonenkis dimension

Journal of Combinatorial Theory Series A
Feature-based methods for large scale dynamic programming

Machine Learning - Special issue on reinforcement learning
Nonparametric Time Series Prediction Through Adaptive ModelSelection

Machine Learning
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Learning in Neural Networks: Theoretical Foundations

Learning in Neural Networks: Theoretical Foundations
Neuro-Dynamic Programming

Neuro-Dynamic Programming
Stochastic Optimal Control: The Discrete-Time Case

Stochastic Optimal Control: The Discrete-Time Case
Least-squares policy iteration

The Journal of Machine Learning Research
Tree-Based Batch Mode Reinforcement Learning

The Journal of Machine Learning Research
Finite time bounds for sampling based fitted value iteration

ICML '05 Proceedings of the 22nd international conference on Machine learning
Max-norm projections for factored MDPs

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 1
Some studies in machine learning using the game of checkers

IBM Journal of Research and Development

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

Machine Learning
Finite-Time Bounds for Fitted Value Iteration

The Journal of Machine Learning Research
Letters: On the bias of batch Bellman residual minimisation

Neurocomputing
Improving optimality of neural rewards regression for data-efficient batch near-optimal policy identification

ICANN'07 Proceedings of the 17th international conference on Artificial neural networks
Recursive least-squares learning with eligibility traces

EWRL'11 Proceedings of the 9th European conference on Recent Advances in Reinforcement Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider batch reinforcement learning problems in continuous space, expected total discounted-reward Markovian Decision Problems. As opposed to previous theoretical work, we consider the case when the training data consists of a single sample path (trajectory) of some behaviour policy. In particular, we do not assume access to a generative model of the environment. The algorithm studied is policy iteration where in successive iterations the Q-functions of the intermediate policies are obtained by means of minimizing a novel Bellman-residual type error. PAC-style polynomial bounds are derived on the number of samples needed to guarantee near-optimal performance where the bound depends on the mixing rate of the trajectory, the smoothness properties of the underlying Markovian Decision Problem, the approximation power and capacity of the function set used.