Decision theoretic generalizations of the PAC model for neural net and other learning applications
Information and Computation
Mathematics of Operations Research
Mathematics of Operations Research
Scale-sensitive dimensions, uniform convergence, and learnability
Journal of the ACM (JACM)
Approximate Gradient Methods in Policy-Space Optimization of Markov Reward Processes
Discrete Event Dynamic Systems
PEGASUS: A policy search method for large MDPs and POMDPs
UAI '00 Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence
Bounds on Sample Size for Policy Evaluation in Markov Environments
COLT '01/EuroCOLT '01 Proceedings of the 14th Annual Conference on Computational Learning Theory and and 5th European Conference on Computational Learning Theory
Simulation-based Uniform Value Function Estimates of Markov Decision Processes
SIAM Journal on Control and Optimization
On the Empirical State-Action Frequencies in Markov Decision Processes Under General Policies
Mathematics of Operations Research
Infinite-horizon policy-gradient estimation
Journal of Artificial Intelligence Research
Neural Network Learning: Theoretical Foundations
Neural Network Learning: Theoretical Foundations
Stochastic Learning and Optimization: A Sensitivity-Based Approach
Stochastic Learning and Optimization: A Sensitivity-Based Approach
Learning and Generalization: With Applications to Neural Networks
Learning and Generalization: With Applications to Neural Networks
Hi-index | 22.14 |
We generalize and build on the PAC Learning framework for Markov Decision Processes developed in Jain and Varaiya (2006). We consider the reward function to depend on both the state and the action. Both the state and action spaces can potentially be countably infinite. We obtain an estimate for the value function of a Markov decision process, which assigns to each policy its expected discounted reward. This expected reward can be estimated as the empirical average of the reward over many independent simulation runs. We derive bounds on the number of runs needed for the convergence of the empirical average to the expected reward uniformly for a class of policies, in terms of the V-C or pseudo dimension of the policy class. We then propose a framework to obtain an @e-optimal policy from simulation. We provide sample complexity of such an approach.