Rewarding behaviors

  • Authors:
  • Fahiem Bacchus;Craig Boutilier;Adam Grove

  • Affiliations:
  • Dept. Computer Science, University of Waterloo, Waterloo, Ontario, Canada;Dept. Computer Science, University of British Columbia, Vancouver, B.C., Canada;NEC Research Institute, Princeton, NJ

  • Venue:
  • AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 2
  • Year:
  • 1996

Quantified Score

Hi-index 0.00

Visualization

Abstract

Markov decision processes (MDPs) are a very popular tool for decision theoretic planning (DTP), partly because of the welldeveloped, expressive theory that includes effective solution techniques. But the Markov assumption--that dynamics and rewards depend on the current state only, and not on history-- is often inappropriate. This is especially true of rewards: we frequently wish to associate rewards with behaviors that extend over time. Of course, such reward processes can be encoded in an MDP should we have a rich enough state space (where states encode enough history). However it is often difficult to "hand craft" suitable state spaces that encode an appropriate amount of history. We consider this problem in the case where non-Markovian rewards are encoded by assigning values to formulas of a temporal logic. These formulas characterize the value of temporally extended behaviors. We argue that this allows a natural representation of many commonly encountered non-Markovian rewards. The main result is an algorithm which, given a decision process with non-Markovian rewards expressed in this manner, automatically constructs an equivalent MDP (with Markovian reward structure), allowing optimal policy construction using standard techniques.