On policy iteration as a Newton's method and polynomial policy iteration algorithms

  • Authors:
  • Omid Madani

  • Affiliations:
  • Department of Computing Science, University of Alberta, Edmonton, AL, Canada

  • Venue:
  • Eighteenth national conference on Artificial intelligence
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Policy iteration is a popular technique for solving Markov decision processes (MDPs). It is easy to describe and implement, and has excellent performance in practice. But not much is known about its complexity. The best upper bound remains exponential, and the best lower bound is a trivial Ω(n) on the number of iterations, where n is the number of states.This paper improves the upper bounds to a polynomial for policy iteration on MDP problems with special graph structure. Our analysis is based on the connection between policy iteration and Newton's method for finding the zero of a convex function. The analysis offers an explanation as to why policy iteration is fast. It also leads to polynomial bounds on several variants of policy iteration for MDPs for which the linear programming formulation requires at most two variables per inequality (MDP(2)). The MDP(2) class includes deterministic MDPs under discounted and average reward criteria. The bounds on the run times include O(mn2 log m log W) on MDP(2) and O(mn2 log m) for deterministic MDPs, where m denotes the number of actions and W denotes the magnitude of the largest number in the problem description.