Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

Authors:
András Antos;Csaba Szepesvári;Rémi Munos
Affiliations:
Computer and Automation Research Inst. of the Hungarian Academy of Sciences, Budapest, Hungary 1111;Computer and Automation Research Inst. of the Hungarian Academy of Sciences, Budapest, Hungary 1111;Institut National de Recherche en Informatique et en Automatique, INRIA Lille, Villeneuve d'Ascq, France 59650
Venue:
Machine Learning
Year:
2008

Citing 18
Cited 16

Sphere packing numbers for subsets of the Boolean n-cube with bounded Vapnik-Chervonenkis dimension

Journal of Combinatorial Theory Series A
Some studies in machine learning using the game of checkers

Computers & thought
Linear least-squares algorithms for temporal difference learning

Machine Learning - Special issue on reinforcement learning
Feature-based methods for large scale dynamic programming

Machine Learning - Special issue on reinforcement learning
Nonparametric Time Series Prediction Through Adaptive ModelSelection

Machine Learning
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Learning in Neural Networks: Theoretical Foundations

Learning in Neural Networks: Theoretical Foundations
Neuro-Dynamic Programming

Neuro-Dynamic Programming
Stochastic Optimal Control: The Discrete-Time Case

Stochastic Optimal Control: The Discrete-Time Case
Kernel-Based Reinforcement Learning

Machine Learning
Off-Policy Temporal Difference Learning with Function Approximation

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Least-squares policy iteration

The Journal of Machine Learning Research
Interpolation-based Q-learning

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Tree-Based Batch Mode Reinforcement Learning

The Journal of Machine Learning Research
A Generalization Error for Q-Learning

The Journal of Machine Learning Research
Finite time bounds for sampling based fitted value iteration

ICML '05 Proceedings of the 22nd international conference on Machine learning
Max-norm projections for factored MDPs

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 1
Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path

COLT'06 Proceedings of the 19th annual conference on Learning Theory

Finite-Time Bounds for Fitted Value Iteration

The Journal of Machine Learning Research
Rollout sampling approximate policy iteration

Machine Learning
Fitted Natural Actor-Critic: A New Algorithm for Continuous State-Action MDPs

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Regularized Fitted Q-Iteration: Application to Planning

Recent Advances in Reinforcement Learning
Policy Iteration for Learning an Exercise Policy for American Options

Recent Advances in Reinforcement Learning
Fast gradient-descent methods for temporal-difference learning with linear function approximation

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Online exploration in least-squares policy iteration

Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems - Volume 2
Hybrid least-squares algorithms for approximate policy evaluation

Machine Learning
Regularized fitted Q-iteration for planning in continuous-space Markovian decision problems

ACC'09 Proceedings of the 2009 conference on American Control Conference
Approximate dynamic programming using Bellman residual elimination and Gaussian process regression

ACC'09 Proceedings of the 2009 conference on American Control Conference
Kalman temporal differences

Journal of Artificial Intelligence Research
ℓ1-Penalized projected bellman residual

EWRL'11 Proceedings of the 9th European conference on Recent Advances in Reinforcement Learning
Regularized least squares temporal difference learning with nested ℓ2 and ℓ1 penalization

EWRL'11 Proceedings of the 9th European conference on Recent Advances in Reinforcement Learning
Performance Guarantees for Empirical Markov Decision Processes with Applications to Multiperiod Inventory Models

Operations Research
Performance bounds for λ policy iteration and application to the game of Tetris

The Journal of Machine Learning Research
Finite-sample analysis of least-squares policy iteration

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we consider the problem of finding a near-optimal policy in a continuous space, discounted Markovian Decision Problem (MDP) by employing value-function-based methods when only a single trajectory of a fixed policy is available as the input. We study a policy-iteration algorithm where the iterates are obtained via empirical risk minimization with a risk function that penalizes high magnitudes of the Bellman-residual. Our main result is a finite-sample, high-probability bound on the performance of the computed policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept (the VC-crossing dimension), the approximation power of the function set and the controllability properties of the MDP. Moreover, we prove that when a linear parameterization is used the new algorithm is equivalent to Least-Squares Policy Iteration. To the best of our knowledge this is the first theoretical result for off-policy control learning over continuous state-spaces using a single trajectory.