Reinforcement learning for POMDPs based on action values and stochastic optimization

  • Authors:
  • Theodore J. Perkins

  • Affiliations:
  • Department of Computer Science, University of Massachusetts Amherst, 140 Governor's Drive, Amherst, MA

  • Venue:
  • Eighteenth national conference on Artificial intelligence
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a new, model-free reinforcement learning algorithm for learning to control partially-observable Markov decision processes. The algorithm incorporates ideas from action-value based reinforcement learning approaches, such as Q-Learning, as well as ideas from the stochastic optimization literature. Key to our approach is a new definition of action value, which makes the algorithm theoretically sound for partially-observable settings. We show that special cases of our algorithm can achieve probability one convergence to locally optimal policies in the limit, or probably approximately correct hill-climbing to a locally optimal policy in a finite number of samples.