Behavior Learning Based on a Policy Gradient Method: Separation of Environmental Dynamics and State Values in Policies

Authors:
Seiji Ishihara;Harukazu Igarashi
Affiliations:
Kinki University, Hiroshima, Japan 739---2116;Shibaura Institute of Technology, Tokyo, Japan 135---8548
Venue:
PRICAI '08 Proceedings of the 10th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
Year:
2008

Citing 5
Cited 1

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Machine Learning
Gradient descent for general reinforcement learning

Proceedings of the 1998 conference on Advances in neural information processing systems II
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Learning to Cooperate via Policy Search

UAI '00 Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence
Applying the policy gradient method to behavior learning in multiagent systems: The pursuit problem

Systems and Computers in Japan

Policy gradient reinforcement learning with environmental dynamics and action-values in policies

KES'11 Proceedings of the 15th international conference on Knowledge-based and intelligent information and engineering systems - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Policy gradient methods are very useful approaches in reinforcement learning. In our policy gradient approach to behavior learning of agents, we define an agent's decision problem at each time step as a problem of minimizing an objective function. In this paper, we give an objective function that consists of two types of parameters representing environmental dynamics and state-value functions. We derive separate learning rules for the two types of parameters so that the two sets of parameters can be learned independently. Separating these two types of parameters will make it possible to reuse state-value functions for agents in other different environmental dynamics, even if the dynamics is stochastic. Our simulation experiments on learning hunter-agent policies in pursuit problems show the effectiveness of our method.