Stochastic Optimal Control: The Discrete-Time Case
Stochastic Optimal Control: The Discrete-Time Case
Kernel-Based Reinforcement Learning
Machine Learning
PEGASUS: A policy search method for large MDPs and POMDPs
UAI '00 Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence
Tree-Based Batch Mode Reinforcement Learning
The Journal of Machine Learning Research
Reinforcement learning with Gaussian processes
ICML '05 Proceedings of the 22nd international conference on Machine learning
Analyzing feature generation for value-function approximation
Proceedings of the 24th international conference on Machine learning
Finite-Time Bounds for Fitted Value Iteration
The Journal of Machine Learning Research
Least Squares SVM for Least Squares TD Learning
Proceedings of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29 -- September 1, 2006, Riva del Garda, Italy
A sparse sampling algorithm for near-optimal planning in large Markov decision processes
IJCAI'99 Proceedings of the 16th international joint conference on Artificial intelligence - Volume 2
Learning bounds for support vector machines with learned kernels
COLT'06 Proceedings of the 19th annual conference on Learning Theory
The kernel recursive least-squares algorithm
IEEE Transactions on Signal Processing
Capacity of reproducing kernel spaces in learning theory
IEEE Transactions on Information Theory
Kernel-Based Least Squares Policy Iteration for Reinforcement Learning
IEEE Transactions on Neural Networks
The Journal of Machine Learning Research
Hi-index | 0.00 |
We consider planning in a Markovian decision problem, i.e., the problem of finding a good policy given access to a generative model of the environment. We propose to use fitted Q-iteration with penalized (or regularized) least-squares regression as the regression subroutine to address the problem of controlling model-complexity. The algorithm is presented in detail for the case when the function space is a reproducing-kernel Hilbert space underlying a user-chosen kernel function. We derive bounds on the quality of the solution and argue that data-dependent penalties can lead to almost optimal performance. A simple example is used to illustrate the benefits of using a penalized procedure.