Efficient approximate policy iteration methods for sequential decision making in reinforcement learning

Authors:
Michail G. Lagoudakis;Ronald Parr
Affiliations:
-;-
Venue:
Efficient approximate policy iteration methods for sequential decision making in reinforcement learning
Year:
2003

Citing 0
Cited 1

Rollout sampling approximate policy iteration

Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Reinforcement learning is a promising learning paradigm in which an agent learns how to make good decisions by interacting with an (unknown) environment. This learning framework can be extended along two dimensions: the number of decision makers (single- or multi-agent) and the nature of interaction (collaborative or competitive). This characterization leads to the four decision making situations that are considered in this thesis and are modeled as Markov decision processes, team Markov decision processes, zero-sum Markov games, and team zero-sum Markov games. Existing reinforcement learning algorithms have not been applied widely on real-world problems, mainly because the required resources grow fast as a function of the size of the problem. Exact, but impractical, solutions are commonly abandoned in favor of approximate, but practical, solutions. Unfortunately, research on efficient and stable approximate methods has focused mainly on the prediction problem, where an agent tries to learn the outcome of a fixed decision policy. This thesis contributes two efficient and stable algorithms based on the general framework of approximate policy iteration for the control problem, whereby the agent tries to learn a good decision policy. Least-Squares Policy Iteration (LSPI) is an algorithm for learning good policies based on a least-squares fixed-point approximation of the value function. LSPI makes efficient use of sample experience and, therefore, is most appropriate for domains where training data are expensive or a simulator of the process is not available. Rollout Classification Policy Iteration (RCPI) on the other hand learns good policies based on the use of rollouts (Monte-Carlo simulation estimates) to train a classifier that represents an approximate policy. For that reason, RCPI is most appropriate for domains where experience comes at no cost or where a simulator is available. Both algorithms exhibit nice theoretical properties, and they bear strong connections to other research areas, such as feature selection and classification learning, respectively. The proposed algorithms are demonstrated on a variety of learning tasks: chain walk, inverted pendulum balancing, bicycle balancing and riding, the game of Tetris, multiagent system administration, distributed power grid control, server-router flow control, two-player soccer game, and multiagent server-router flow control. These results demonstrate clearly the efficiency and applicability of the new algorithms on large reinforcement learning control problems.