PEGASUS: a policy search method for large MDPs and POMDPs

Authors:
Andrew Y. Ng;Michael Jordan
Affiliations:
Computer Science Division, UC Berkeley, Berkeley, CA;Computer Science Division & Department of Statistics, UC Berkeley, Berkeley, CA
Venue:
UAI'00 Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence
Year:
2000

Citing 10
Cited 12

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Machine Learning
Decision theoretic generalizations of the PAC model for neural net and other learning applications

Information and Computation
Bounding the Vapnik-Chervonenkis Dimension of Concept Classes Parameterized by Real Numbers

Machine Learning - Special issue on COLT '93
Gradient descent for general reinforcement learning

Proceedings of the 1998 conference on Advances in neural information processing systems II
Dynamic Programming and Optimal Control

Dynamic Programming and Optimal Control
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Learning to Drive a Bicycle Using Reinforcement Learning and Shaping

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Learning and value function approximation in complex decision processes

Learning and value function approximation in complex decision processes
Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics)

Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics)
Learning finite-state controllers for partially observable environments

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

Exploiting probabilistic knowledge under uncertain sensing for efficient robot behaviour

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Lagrangian relaxation for large-scale multi-agent planning

Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 3
Why long words take longer to read: the role of uncertainty about word length

CMCL '12 Proceedings of the 3rd Workshop on Cognitive Modeling and Computational Linguistics
Adaptive reservoir computing through evolution and learning

Neurocomputing
Performance Guarantees for Empirical Markov Decision Processes with Applications to Multiperiod Inventory Models

Operations Research
Lagrangian Relaxation for Large-Scale Multi-agent Planning

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 02
A partially observable hybrid system model for bipedal locomotion for adapting to terrain variations

Proceedings of the 16th international conference on Hybrid systems: computation and control
Efficient sample reuse in policy gradients with parameter-based exploration

Neural Computation
Adaptive collective routing using gaussian process dynamic congestion models

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Probabilistic model-based imitation learning

Adaptive Behavior - Animals, Animats, Software Agents, Robots, Adaptive Systems
Scheduling sensors for monitoring sentient spaces using an approximate POMDP policy

Pervasive and Mobile Computing
Survey Control: A perspective

Automatica (Journal of IFAC)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a new approach to the problem of searching a space of policies for a Markov decision process (MDP) or a partially observable Markov decision process (POMDP), given a model. Our approach is based on the following observation: Any (PO)MDP can be transformed into an "equivalent" POMDP in which all state transitions (given the current state and action) are deterministic. This reduces the general problem of policy search to one in which we need only consider POMDPs with deterministic transitions. We give a natural way of estimating the value of all policies in these transformed POMDPs. Policy search is then simply performed by searching for a policy with high estimated value. We also establish conditions under which our value estimates will be good, recovering theoretical results similar to those of Kearns, Mansour and Ng [7], but with "sample complexity" bounds that have only a polynomial rather than exponential dependence on the horizon time. Our method applies to arbitrary POMDPs, including ones with infinite state and action spaces. We also present empirical results for our approach on a small discrete problem, and on a complex continuous state/continuous action problem involving learning to ride a bicycle.