A Model of Partially Observable State Game and its Optimality
Applied Intelligence
R-max - a general polynomial time algorithm for near-optimal reinforcement learning
The Journal of Machine Learning Research
A Unified Analysis of Value-Function-Based Reinforcement Learning Algorithms
Neural Computation
Optimistic Bayesian sampling in contextual-bandit problems
The Journal of Machine Learning Research
Hi-index | 0.00 |
Reinforcement learning is the process by which an autonomous agent uses its experience interacting with an environment to improve its behavior. The Markov decision process (MDP) model is a popular way of formalizing the reinforcement-learning problem, but it is by no means the only way. In this paper, we show how many of the important theoretical results concerning reinforcement learning in MDPs extend to a generalized MDP model that includes MDPs, two-player games and MDPs under a worst-case optimality criterion as special cases. The basis of this extension is a stochastic-approximation theorem that reduces asynchronous convergence to synchronous convergence. Keywords: Reinforcement learning, Q-learning convergence, Markov games