Communications of the ACM
The complexity of Markov decision processes
Mathematics of Operations Research
Quantifying inductive bias: AI learning algorithms and Valiant's learning framework
Artificial Intelligence
Machine Learning - Special issue on genetic algorithms
Learning to Perceive and Act by Trial and Error
Machine Learning
The Convergence of TD(λ) for General λ
Machine Learning
An approach to anytime learning
ML92 Proceedings of the ninth international workshop on Machine learning
Temporal difference learning of backgammon strategy
ML92 Proceedings of the ninth international workshop on Machine learning
Learning to Predict by the Methods of Temporal Differences
Machine Learning
Inductive Inference, DFAs, and Computational Complexity
AII '89 Proceedings of the International Workshop on Analogical and Inductive Inference
Markov decision processes in large state spaces
COLT '95 Proceedings of the eighth annual conference on Computational learning theory
Learning curve bounds for a Markov decision process with undiscounted rewards
COLT '96 Proceedings of the ninth annual conference on Computational learning theory
A competitive approach to game learning
COLT '96 Proceedings of the ninth annual conference on Computational learning theory
PAC adaptive control of linear systems
COLT '97 Proceedings of the tenth annual conference on Computational learning theory
Machine Learning
Near-Optimal Reinforcement Learning in Polynomial Time
Machine Learning
Polynomial-time reinforcement learning of near-optimal policies
Eighteenth national conference on Artificial intelligence
Efficient learning of multi-step best response
Proceedings of the fourth international joint conference on Autonomous agents and multiagent systems
PAC model-free reinforcement learning
ICML '06 Proceedings of the 23rd international conference on Machine learning
Efficient PAC Learning for Episodic Tasks with Acyclic State Spaces
Discrete Event Dynamic Systems
Efficient reinforcement learning with relocatable action models
AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 1
Reinforcement learning: a survey
Journal of Artificial Intelligence Research
Customized learning algorithms for episodic tasks withacyclic state spaces
CASE'09 Proceedings of the fifth annual IEEE international conference on Automation science and engineering
Reinforcement Learning in Finite MDPs: PAC Analysis
The Journal of Machine Learning Research
Near-optimal Regret Bounds for Reinforcement Learning
The Journal of Machine Learning Research
Reducing reinforcement learning to KWIK online regression
Annals of Mathematics and Artificial Intelligence
AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1
Multiagent reinforcement learning algorithm using temporal difference error
ISNN'05 Proceedings of the Second international conference on Advances in Neural Networks - Volume Part I
On the efficient implementation biologic reinforcement learning using eligibility traces
ISNN'06 Proceedings of the Third international conference on Advances in Neural Networks - Volume Part I
A reinforcement learning algorithm using temporal difference error in ant model
IWANN'05 Proceedings of the 8th international conference on Artificial Neural Networks: computational Intelligence and Bioinspired Systems
A cooperation online reinforcement learning approach in ant-q
ICONIP'06 Proceedings of the 13 international conference on Neural Information Processing - Volume Part I
Efficient ant reinforcement learning using replacing eligibility traces
ICAISC'06 Proceedings of the 8th international conference on Artificial Intelligence and Soft Computing
Book reviews: Self-learning control of finite Markov chains
Automatica (Journal of IFAC)
Hi-index | 0.00 |
In this paper we propose a new formal model for studying reinforcement learning, based on Valiant's PAC framework.In our model the learner does not have direct access to every state of the environment. Instead, every sequence of experiments starts in a fixed initial state and the learner is provided with a “reset” operation that interrupts the current sequence of experiments and starts a new one (from the initial state).We do not require the agent to learn the optimal policy but only a good approximation of it with high probability. More precisely, we require the learner to produce a policy whose expected value from the initial state is &egr;-close to that of the optimal policy, with probability no less than 1−&dgr;.For this model, we describe an algorithm that produces such an (&egr;,&dgr;)-optimal policy for any environment, in time polynomial in N,K,1/&egr;,1/&dgr;,1/(1−&bgr;) and rmax, where N is the number of states of the environment, K is the maximum number of actions in a state, &bgr; is the discount factor and rmax is the maximum reward on any transition.