When are the value iteration maximizers close to an optimal stationary policy of a discounted Markov decision process?: closing the gap between the Borel space theory and actual computations

  • Authors:
  • Raúl Montes-De-Oca;Enrique Lemus-Rodríguez

  • Affiliations:
  • Departamento de Matemáticas, Universidad Autónoma Metropolitana-Iztapalapa, México D.F., México;Escuela de Actuaría, Universidad Anáhuac México-Norte, Edo.de México, México

  • Venue:
  • WSEAS Transactions on Mathematics
  • Year:
  • 2010

Quantified Score

Hi-index 0.01

Visualization

Abstract

Markov Decision Processes [MDPs] have been repeteadly used in Economy and Engineering but apparently are still far from achieving its full potential due to the computational difficulties inherent to the subject due to the usual impossibility of finding explicit optimal solutions. Value iteration is an elegant, theoretical method of approximating an optimal solution, frequently mentioned in Economy when MDPs are used. To extend its use and benefits, improved understanding of its convergence is needed still even if it would appear not to be the case. For instance, the corresponding convergence properties of the policies is still not well understood. In this paper we further analyze this issue: using Value Iteration, if a stationary policy fN is obtained in th N-th iteration, such that the optimal discounted rewards of f* and fN are close, we would like to know whether are the corresponding actions f*(x) and fN(x) necessarily close for each state x? To our knowledge this question is still largely open. In this paper it is studied when it is possible to stop the value iteration algorithm so that the corresponding maximizer stationary policy fN approximates an optimal policy both in the total discounted reward and in the action space (uniformly over the state space). In this article the action space is assumed to be a compact set and the reward function bounded. An ergodicity condition on the transition probability law and a structural condition on the reward function are needed. Under these conditions, an upper bound on the number of steps needed in the value iteration algorithm, such that its corresponding maximizer is a uniform approximation of the optimal policy, is obtained.