Dynamic preferences in multi-criteria reinforcement learning

  • Authors:
  • Sriraam Natarajan;Prasad Tadepalli

  • Affiliations:
  • Oregon State University;Oregon State University

  • Venue:
  • ICML '05 Proceedings of the 22nd international conference on Machine learning
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

The current framework of reinforcement learning is based on maximizing the expected returns based on scalar rewards. But in many real world situations, tradeoffs must be made among multiple objectives. Moreover, the agent's preferences between different objectives may vary with time. In this paper, we consider the problem of learning in the presence of time-varying preferences among multiple objectives, using numeric weights to represent their importance. We propose a method that allows us to store a finite number of policies, choose an appropriate policy for any weight vector and improve upon it. The idea is that although there are infinitely many weight vectors, they may be well-covered by a small number of optimal policies. We show this empirically in two domains: a version of the Buridan's ass problem and network routing.