Coordinated learning in multiagent MDPs with infinite state-space

  • Authors:
  • Francisco S. Melo;M. Isabel Ribeiro

  • Affiliations:
  • School of Computer Science, Carnegie Mellon University, Pittsburgh, USA 15213;Institute for Systems and Robotics, Instituto Superior Técnico, Lisbon, Portugal 1049-001

  • Venue:
  • Autonomous Agents and Multi-Agent Systems
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we address the problem of simultaneous learning and coordination in multiagent Markov decision problems (MMDPs) with infinite state-spaces. We separate this problem in two distinct subproblems: learning and coordination. To tackle the problem of learning, we survey Q-learning with soft-state aggregation (Q-SSA), a well-known method from the reinforcement learning literature (Singh et al. in Advances in neural information processing systems. MIT Press, Cambridge, vol 7, pp 361---368, 1994). Q-SSA allows the agents in the game to approximate the optimal Q-function, from which the optimal policies can be computed. We establish the convergence of Q-SSA and introduce a new result describing the rate of convergence of this method. In tackling the problem of coordination, we start by pointing out that the knowledge of the optimal Q-function is not enough to ensure that all agents adopt a jointly optimal policy. We propose a novel coordination mechanism that, given the knowledge of the optimal Q-function for an MMDP, ensures that all agents converge to a jointly optimal policy in every relevant state of the game. This coordination mechanism, approximate biased adaptive play (ABAP), extends biased adaptive play (Wang and Sandholm in Advances in neural information processing systems. MIT Press, Cambridge, vol 15, pp 1571---1578, 2003) to MMDPs with infinite state-spaces. Finally, we combine Q-SSA with ABAP, this leading to a novel algorithm in which learning of the game and coordination take place simultaneously. We discuss several important properties of this new algorithm and establish its convergence with probability 1. We also provide simple illustrative examples of application.