Efficient PAC Learning for Episodic Tasks with Acyclic State Spaces

Authors:
Spyros Reveliotis;Theologos Bountourelis
Affiliations:
School of Industrial & Systems Engineering, Georgia Institute of Technology, Atlanta, USA;School of Industrial & Systems Engineering, Georgia Institute of Technology, Atlanta, USA
Venue:
Discrete Event Dynamic Systems
Year:
2007

Citing 10
Cited 3

Efficient reinforcement learning

COLT '94 Proceedings of the seventh annual conference on Computational learning theory
An introduction to computational learning theory

An introduction to computational learning theory
Finite-sample convergence rates for Q-learning and indirect algorithms

Proceedings of the 1998 conference on Advances in neural information processing systems II
Reinforcement Learning

Reinforcement Learning
Machine Learning

Machine Learning
Neuro-Dynamic Programming

Neuro-Dynamic Programming
Near-Optimal Reinforcement Learning in Polynomial Time

Machine Learning
Expected Mistake Bound Model for On-Line Reinforcement Learning

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
PAC Bounds for Multi-armed Bandit and Markov Decision Processes

COLT '02 Proceedings of the 15th Annual Conference on Computational Learning Theory
Operations Management

Operations Management

Optimal Node Visitation in Acyclic Stochastic Digraphs with Multi-threaded Traversals and Internal Visitation Requirements

Discrete Event Dynamic Systems
Customized learning algorithms for episodic tasks withacyclic state spaces

CASE'09 Proceedings of the fifth annual IEEE international conference on Automation science and engineering
Optimal flow control in acyclic networks with uncontrollable routings and precedence constraints

Discrete Event Dynamic Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper considers the problem of computing an optimal policy for a Markov decision process, under lack of complete a priori knowledge of (1) the branching probability distributions determining the evolution of the process state upon the execution of the different actions, and (2) the probability distributions characterizing the immediate rewards returned by the environment as a result of the execution of these actions at different states of the process. In addition, it is assumed that the underlying process evolves in a repetitive, episodic manner, with each episode starting from a well-defined initial state and evolving over an acyclic state space. A novel efficient algorithm for this problem is proposed, and its convergence properties and computational complexity are rigorously characterized in the formal framework of computational learning theory. Furthermore, in the process of deriving the aforementioned results, the presented work generalizes Bechhofer's "indifference-zone" approach for the ranking & selection problem, that arises in statistical inference theory, so that it applies to populations with bounded general distributions.