Strategy Iteration Is Strongly Polynomial for 2-Player Turn-Based Stochastic Games with a Constant Discount Factor

  • Authors:
  • Thomas Dueholm Hansen;Peter Bro Miltersen;Uri Zwick

  • Affiliations:
  • Aarhus University;Aarhus University;Tel Aviv University

  • Venue:
  • Journal of the ACM (JACM)
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Ye [2011] showed recently that the simplex method with Dantzig’s pivoting rule, as well as Howard’s policy iteration algorithm, solve discounted Markov decision processes (MDPs), with a constant discount factor, in strongly polynomial time. More precisely, Ye showed that both algorithms terminate after at most O(mn1−γ log n1−γ) iterations, where n is the number of states, m is the total number of actions in the MDP, and 0 γ O(m1−γ log n1−γ) iterations. Second, and more importantly, we show that the same bound applies to the number of iterations performed by the strategy iteration (or strategy improvement) algorithm, a generalization of Howard’s policy iteration algorithm used for solving 2-player turn-based stochastic games with discounted zero-sum rewards. This provides the first strongly polynomial algorithm for solving these games, solving a long standing open problem. Combined with other recent results, this provides a complete characterization of the complexity the standard strategy iteration algorithm for 2-player turn-based stochastic games; it is strongly polynomial for a fixed discount factor, and exponential otherwise.