Policy gradient methods in multi-agent systems: pursuit problem

Authors:
Seiji Ishihara;Harukazu Igarashi
Affiliations:
School of Engineering, Kinki University, 1 Takayaumenobe, Higashi-Hiroshima-shi, 739-2116 Japan;School of Engineering, Kinki University, 1 Takayaumenobe, Higashi-Hiroshima-shi, 739-2116 Japan
Venue:
Design and application of hybrid intelligent systems
Year:
2003

Citing 5
Cited 0

Representing and using organizational knowledge in DAI systems

Distributed Artificial Intelligence (Vol. 2)
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Machine Learning
Gradient descent for general reinforcement learning

Proceedings of the 1998 conference on Advances in neural information processing systems II
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Learning to Cooperate via Policy Search

UAI '00 Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Policy gradient methods are very useful approaches to reinforcement learning in multi-agent systems. Employing these methods, a decision problem in a multi-agent system can be divided into a set of independent decision problems for each agent by adopting autonomous decentralized control. In addition, these methods use stochastic policies that include parameters. The parameters are updated stochastically to maximize the expectation of the reward. In this paper, first, we consider each decision problem as a problem of minimizing an objective function. We adopt a Boltzman distribution function for the stochastic policy. The objective function is used to represent the energy of the Boltzman distribution function. Next, we show that the objective function can be defined by a state-value function, the sum of weight parameters of state-action rules, and heuristic potentials. Moreover, we apply this method to pursuit problems. Experimental results show that the method used with these objective functions can produce episodes as short as a Q-learning method does, and can easily deal with limitations such as time-window resrictions imposed on the episode length in addition to utilizing heuristic knowledge such as the attractive potential to a target.