Polynomial-time reinforcement learning of near-optimal policies

Authors:
Karèn Pivazyan;Yoav Shoham
Affiliations:
Management, Science and Engineering Department, Stanford University, Stanford, CA;Computer Science Department, Stanford University, Stanford, CA
Venue:
Eighteenth national conference on Artificial intelligence
Year:
2002

Citing 7
Cited 4

Efficient reinforcement learning

COLT '94 Proceedings of the seventh annual conference on Computational learning theory
Competitive Markov decision processes

Competitive Markov decision processes
Dynamic Programming and Optimal Control

Dynamic Programming and Optimal Control
Neuro-Dynamic Programming

Neuro-Dynamic Programming
Expected Mistake Bound Model for On-Line Reinforcement Learning

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Near-Optimal Reinforcement Learning in Polynominal Time

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
R-MAX: a general polynomial time algorithm for near-optimal reinforcement learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2

Efficient learning of multi-step best response

Proceedings of the fourth international joint conference on Autonomous agents and multiagent systems
AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents

Machine Learning
Perspectives on multiagent learning

Artificial Intelligence
Performance Guarantees for Empirical Markov Decision Processes with Applications to Multiperiod Inventory Models

Operations Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Inspired by recent results on polynomial time reinforcement algorithms that accumulate near-optimal rewards, we look at the related problem of quickly learning near-optimal policies. The new problem is obviously related to the previous one, but different in important ways. We provide simple algorithms for MDPs, zero-sum and common-payoff Stochastic Games, and a uniform framework for proving their polynomial complexity. Unlike the previously studied problem, these bounds use the minimum between the mixing time and a new quantity - the spectral radius. Unlike the previous results, our results apply uniformly to the average and discounted cases.