Sphere packing numbers for subsets of the Boolean n-cube with bounded Vapnik-Chervonenkis dimension
Journal of Combinatorial Theory Series A
Some studies in machine learning using the game of checkers
Computers & thought
Linear least-squares algorithms for temporal difference learning
Machine Learning - Special issue on reinforcement learning
Feature-based methods for large scale dynamic programming
Machine Learning - Special issue on reinforcement learning
Nonparametric Time Series Prediction Through Adaptive ModelSelection
Machine Learning
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Learning in Neural Networks: Theoretical Foundations
Learning in Neural Networks: Theoretical Foundations
Neuro-Dynamic Programming
Stochastic Optimal Control: The Discrete-Time Case
Stochastic Optimal Control: The Discrete-Time Case
Kernel-Based Reinforcement Learning
Machine Learning
Off-Policy Temporal Difference Learning with Function Approximation
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Least-squares policy iteration
The Journal of Machine Learning Research
Interpolation-based Q-learning
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Tree-Based Batch Mode Reinforcement Learning
The Journal of Machine Learning Research
A Generalization Error for Q-Learning
The Journal of Machine Learning Research
Finite time bounds for sampling based fitted value iteration
ICML '05 Proceedings of the 22nd international conference on Machine learning
Max-norm projections for factored MDPs
IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 1
COLT'06 Proceedings of the 19th annual conference on Learning Theory
Finite-Time Bounds for Fitted Value Iteration
The Journal of Machine Learning Research
Rollout sampling approximate policy iteration
Machine Learning
Fitted Natural Actor-Critic: A New Algorithm for Continuous State-Action MDPs
ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Regularized Fitted Q-Iteration: Application to Planning
Recent Advances in Reinforcement Learning
Policy Iteration for Learning an Exercise Policy for American Options
Recent Advances in Reinforcement Learning
Fast gradient-descent methods for temporal-difference learning with linear function approximation
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Online exploration in least-squares policy iteration
Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems - Volume 2
Hybrid least-squares algorithms for approximate policy evaluation
Machine Learning
Regularized fitted Q-iteration for planning in continuous-space Markovian decision problems
ACC'09 Proceedings of the 2009 conference on American Control Conference
Approximate dynamic programming using Bellman residual elimination and Gaussian process regression
ACC'09 Proceedings of the 2009 conference on American Control Conference
Journal of Artificial Intelligence Research
ℓ1-Penalized projected bellman residual
EWRL'11 Proceedings of the 9th European conference on Recent Advances in Reinforcement Learning
Regularized least squares temporal difference learning with nested ℓ2 and ℓ1 penalization
EWRL'11 Proceedings of the 9th European conference on Recent Advances in Reinforcement Learning
Performance bounds for λ policy iteration and application to the game of Tetris
The Journal of Machine Learning Research
Finite-sample analysis of least-squares policy iteration
The Journal of Machine Learning Research
Hi-index | 0.00 |
In this paper we consider the problem of finding a near-optimal policy in a continuous space, discounted Markovian Decision Problem (MDP) by employing value-function-based methods when only a single trajectory of a fixed policy is available as the input. We study a policy-iteration algorithm where the iterates are obtained via empirical risk minimization with a risk function that penalizes high magnitudes of the Bellman-residual. Our main result is a finite-sample, high-probability bound on the performance of the computed policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept (the VC-crossing dimension), the approximation power of the function set and the controllability properties of the MDP. Moreover, we prove that when a linear parameterization is used the new algorithm is equivalent to Least-Squares Policy Iteration. To the best of our knowledge this is the first theoretical result for off-policy control learning over continuous state-spaces using a single trajectory.