Sphere packing numbers for subsets of the Boolean n-cube with bounded Vapnik-Chervonenkis dimension
Journal of Combinatorial Theory Series A
Feature-based methods for large scale dynamic programming
Machine Learning - Special issue on reinforcement learning
Nonparametric Time Series Prediction Through Adaptive ModelSelection
Machine Learning
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Learning in Neural Networks: Theoretical Foundations
Learning in Neural Networks: Theoretical Foundations
Neuro-Dynamic Programming
Stochastic Optimal Control: The Discrete-Time Case
Stochastic Optimal Control: The Discrete-Time Case
Least-squares policy iteration
The Journal of Machine Learning Research
Tree-Based Batch Mode Reinforcement Learning
The Journal of Machine Learning Research
Finite time bounds for sampling based fitted value iteration
ICML '05 Proceedings of the 22nd international conference on Machine learning
Max-norm projections for factored MDPs
IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 1
Some studies in machine learning using the game of checkers
IBM Journal of Research and Development
Finite-Time Bounds for Fitted Value Iteration
The Journal of Machine Learning Research
Letters: On the bias of batch Bellman residual minimisation
Neurocomputing
ICANN'07 Proceedings of the 17th international conference on Artificial neural networks
Recursive least-squares learning with eligibility traces
EWRL'11 Proceedings of the 9th European conference on Recent Advances in Reinforcement Learning
Hi-index | 0.00 |
We consider batch reinforcement learning problems in continuous space, expected total discounted-reward Markovian Decision Problems. As opposed to previous theoretical work, we consider the case when the training data consists of a single sample path (trajectory) of some behaviour policy. In particular, we do not assume access to a generative model of the environment. The algorithm studied is policy iteration where in successive iterations the Q-functions of the intermediate policies are obtained by means of minimizing a novel Bellman-residual type error. PAC-style polynomial bounds are derived on the number of samples needed to guarantee near-optimal performance where the bound depends on the mixing rate of the trajectory, the smoothness properties of the underlying Markovian Decision Problem, the approximation power and capacity of the function set used.