On policy iteration as a Newton's method and polynomial policy iteration algorithms

Authors:
Omid Madani
Affiliations:
Department of Computing Science, University of Alberta, Edmonton, AL, Canada
Venue:
Eighteenth national conference on Artificial intelligence
Year:
2002

Citing 11
Cited 4

Artificial intelligence: a modern approach

Artificial intelligence: a modern approach
Simple and Fast Algorithms for Linear and Integer Programs with Two Variables Per Inequality

SIAM Journal on Computing
A polynomial combinatorial algorithm for generalized minimum cost flow

STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Deciding Linear Inequalities by Computing Loop Residues

Journal of the ACM (JACM)
LAO: a heuristic search algorithm that finds solutions with loops

Artificial Intelligence - Special issue on heuristic search in artificial intelligence
Dynamic Programming and Optimal Control

Dynamic Programming and Optimal Control
Algorithms for sequential decision-making

Algorithms for sequential decision-making
Finite-memory control of partially observable systems

Finite-memory control of partially observable systems
Complexity results for infinite-horizon markov decision processes

Complexity results for infinite-horizon markov decision processes
Max-norm projections for factored MDPs

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 1
On the complexity of policy iteration

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

Discounted deterministic Markov decision processes and discounted all-pairs shortest paths

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
On policy learning in restricted policy spaces

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Discounted deterministic Markov decision processes and discounted all-pairs shortest paths

ACM Transactions on Algorithms (TALG)
Polynomial value iteration algorithms for deterministic MDPs

UAI'02 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Policy iteration is a popular technique for solving Markov decision processes (MDPs). It is easy to describe and implement, and has excellent performance in practice. But not much is known about its complexity. The best upper bound remains exponential, and the best lower bound is a trivial Ω(n) on the number of iterations, where n is the number of states.This paper improves the upper bounds to a polynomial for policy iteration on MDP problems with special graph structure. Our analysis is based on the connection between policy iteration and Newton's method for finding the zero of a convex function. The analysis offers an explanation as to why policy iteration is fast. It also leads to polynomial bounds on several variants of policy iteration for MDPs for which the linear programming formulation requires at most two variables per inequality (MDP(2)). The MDP(2) class includes deterministic MDPs under discounted and average reward criteria. The bounds on the run times include O(mn2 log m log W) on MDP(2) and O(mn2 log m) for deterministic MDPs, where m denotes the number of actions and W denotes the magnitude of the largest number in the problem description.