Experience generalization for concurrent reinforcement learners: the minimax-QS algorithm
Proceedings of the first international joint conference on Autonomous agents and multiagent systems: part 3
Artificial Intelligence Review
ε-mdps: learning in varying environments
The Journal of Machine Learning Research
Application of Markov chains in an interactive information retrieval system
Information Processing and Management: an International Journal
A Unified Analysis of Value-Function-Based Reinforcement Learning Algorithms
Neural Computation
Heuristic Reinforcement Learning Applied to RoboCup Simulation Agents
RoboCup 2007: Robot Soccer World Cup XI
Optimistic-Pessimistic Q-Learning Algorithm for Multi-Agent Systems
MATES '08 Proceedings of the 6th German conference on Multiagent System Technologies
Multi-Agent Reinforcement Learning Algorithm with Variable Optimistic-Pessimistic Criterion
Proceedings of the 2008 conference on ECAI 2008: 18th European Conference on Artificial Intelligence
Improving Reinforcement Learning by Using Case Based Heuristics
ICCBR '09 Proceedings of the 8th International Conference on Case-Based Reasoning: Case-Based Reasoning Research and Development
Perseus: randomized point-based value iteration for POMDPs
Journal of Artificial Intelligence Research
Heuristic selection of actions in multiagent reinforcement learning
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Relational reinforcement learning applied to shared attention
IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
Heuristic Q-learning soccer players: a new reinforcement learning approach to RoboCup simulation
EPIA'07 Proceedings of the aritficial intelligence 13th Portuguese conference on Progress in artificial intelligence
Hi-index | 0.00 |
The problem of maximizing the expected total discounted reward in a completely observable Markovian environment, i.e., a Markov decision process (MDP), models a particular class of sequential decision problems. Algorithms have been developed for making optimal decisions in MDPs given either an MDP specification or the opportunity to interact with the MDP over time. Recently, other sequential decision-making problems have been studied prompting the development of new algorithms and analyses. We describe a new generalized model that subsumes MDPs as well as many of the recent variations. We prove some basic results concerning this model and develop generalizations of value iteration, policy iteration, model-based reinforcement-learning, and Q-learning that can be used to make optimal decisions in the generalized model under various assumptions. Applications of the theory to particular models are described, including risk-averse MDPs, exploration-sensitive MDPs, sarsa, Q-learning with spreading, two-player games, and approximate max picking via sampling. Central to the results are the contraction property of the value operator and a stochastic-approximation theorem that reduces asynchronous convergence to synchronous convergence.