Infinite-horizon policy-gradient estimation

Authors:
Jonathan Baxter;Peter L. Bartlett
Affiliations:
WhizBang! Labs., Pittsburgh, PA;BIOwulf Technologies., Berkeley, CA
Venue:
Journal of Artificial Intelligence Research
Year:
2001

Citing 20
Cited 63

Likelihood ratio gradient estimation for stochastic systems

Communications of the ACM - Special issue on simulation
How to optimize discrete-event systems from a single sample path by the score function method

Annals of Operations Research
Probability

Probability
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Machine Learning
Practical Issues in Temporal Difference Learning

Machine Learning
Decomposable score function estimators for sensitivity analysis and optimization of queueing networks

Annals of Operations Research - Special issue on sensitivity analysis and optimization of discrete event systems
TD-Gammon, a self-teaching backgammon program, achieves master-level play

Neural Computation
Sensitivity analysis via likelihood ratios

WSC '86 Proceedings of the 18th conference on Winter simulation
Stochastic approximation for Monte Carlo optimization

WSC '86 Proceedings of the 18th conference on Winter simulation
Gradient descent for general reinforcement learning

Proceedings of the 1998 conference on Advances in neural information processing systems II
Learning to Play Chess Using Temporal Differences

Machine Learning
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Neuro-Dynamic Programming

Neuro-Dynamic Programming
Reinforcement Learning in POMDPs with Function Approximation

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
An Analysis of Actor/Critic Algorithms Using Eligibility Traces: Reinforcement Learning with Imperfect Value Function

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Learning to Cooperate via Policy Search

UAI '00 Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence
Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning

COLT '00 Proceedings of the Thirteenth Annual Conference on Computational Learning Theory
Dynamic Programming and Optimal Control, Vol. II

Dynamic Programming and Optimal Control, Vol. II
A reinforcement learning approach to job-shop scheduling

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Learning finite-state controllers for partially observable environments

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

From Perturbation Analysis to Markov Decision Processes and Reinforcement Learning

Discrete Event Dynamic Systems
Optimizing Average Reward Using Discounted Rewards

COLT '01/EuroCOLT '01 Proceedings of the 14th Annual Conference on Computational Learning Theory and and 5th European Conference on Computational Learning Theory
Automatic generation of an agent's basic behaviors

AAMAS '03 Proceedings of the second international joint conference on Autonomous agents and multiagent systems
An introduction to reinforcement learning theory: value function methods

Advanced lectures on machine learning
Least-squares policy iteration

The Journal of Machine Learning Research
Dynamic abstraction in reinforcement learning via clustering

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

The Journal of Machine Learning Research
Basic Ideas for Event-Based Optimization of Markov Systems

Discrete Event Dynamic Systems
Attention-Gated Reinforcement Learning of Internal Representations for Classification

Neural Computation
Universal parameter optimisation in games based on SPSA

Machine Learning
Fuzzy Policy Reinforcement Learning in Cooperative Multi-robot Systems

Journal of Intelligent and Robotic Systems
Reinforcement Learning Through Modulation of Spike-Timing-Dependent Synaptic Plasticity

Neural Computation
Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

The Journal of Machine Learning Research
Policy Gradient in Continuous Time

The Journal of Machine Learning Research
STEWARD: demo of spatio-textual extraction on the web aiding the retrieval of documents

dg.o '07 Proceedings of the 8th annual international conference on Digital government research: bridging disciplines & domains
Point-Based Value Iteration for Continuous POMDPs

The Journal of Machine Learning Research
Reinforcement Learning, Spike-Time-Dependent Plasticity, and the BCM Rule

Neural Computation
Conditional random fields for multi-agent reinforcement learning

Proceedings of the 24th international conference on Machine learning
Shaping multi-agent systems with gradient reinforcement learning

Autonomous Agents and Multi-Agent Systems
Learning CPG-based Biped Locomotion with a Policy Gradient Method: Application to a Humanoid Robot

International Journal of Robotics Research
Automated Design of Adaptive Controllers for Modular Robots using Reinforcement Learning

International Journal of Robotics Research
Reinforcement learning in the presence of rare events

Proceedings of the 25th international conference on Machine learning
Encoding and decoding spikes for dynamic stimuli

Neural Computation
Reinforcement Learning in Fine Time Discretization

ICANNGA '07 Proceedings of the 8th international conference on Adaptive and Natural Computing Algorithms, Part I
An Extremely Simple Reinforcement Learning Rule for Neural Networks

ISNN '07 Proceedings of the 4th international symposium on Neural Networks: Advances in Neural Networks
Episodic Reinforcement Learning by Logistic Reward-Weighted Regression

ICANN '08 Proceedings of the 18th international conference on Artificial Neural Networks, Part I
A New Natural Policy Gradient by Stationary Distribution Metric

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Reinforcement Learning: A Tutorial Survey and Recent Advances

INFORMS Journal on Computing
Learning when to stop thinking and do something!

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
An empirical analysis of value function-based and policy search reinforcement learning

Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems - Volume 2
Reordering Sparsification of Kernel Machines in Approximate Policy Iteration

ISNN 2009 Proceedings of the 6th International Symposium on Neural Networks: Advances in Neural Networks - Part II
Machine learning for fast quadrupedal locomotion

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Geometric variance reduction in Markov chains: application to value function and gradient estimation

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Reinforcement learning for vulnerability assessment in peer-to-peer networks

IAAI'08 Proceedings of the 20th national conference on Innovative applications of artificial intelligence - Volume 3
Perseus: randomized point-based value iteration for POMDPs

Journal of Artificial Intelligence Research
Natural actor-critic algorithms

Automatica (Journal of IFAC)
Operant matching as a nash equilibrium of an intertemporal game

Neural Computation
A gradient-based reinforcement learning approach to dynamic pricing in partially-observable environments

A gradient-based reinforcement learning approach to dynamic pricing in partially-observable environments
Derivatives of logarithmic stationary distributions for policy gradient reinforcement learning

Neural Computation
On-Line Policy Gradient Estimation with Multi-Step Sampling

Discrete Event Dynamic Systems
Transfer Learning for Reinforcement Learning Domains: A Survey

The Journal of Machine Learning Research
A Convergent Online Single Time Scale Actor Critic Algorithm

The Journal of Machine Learning Research
Reinforcement learning for cooperative actions in a partially observable multi-agent system

ICANN'07 Proceedings of the 17th international conference on Artificial neural networks
Interaction of culture-based learning and cooperative co-evolution and its application to automatic behavior-based system design

IEEE Transactions on Evolutionary Computation
Critical factors in the empirical performance of temporal difference and evolutionary methods for reinforcement learning

Autonomous Agents and Multi-Agent Systems
Simulation-based optimization of Markov decision processes: An empirical process theory approach

Automatica (Journal of IFAC)
The Dynamics of Multi-Agent Reinforcement Learning

Proceedings of the 2010 conference on ECAI 2010: 19th European Conference on Artificial Intelligence
A Generalized Path Integral Control Approach to Reinforcement Learning

The Journal of Machine Learning Research
Hessian matrix distribution for Bayesian policy gradient reinforcement learning

Information Sciences: an International Journal
Decentralized algorithms for adaptive pricing in multiclass loss networks

IEEE/ACM Transactions on Networking (TON)
Learning powerful kicks on the aibo ERS-7: the quest for a striker

RoboCup 2010
A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

The Journal of Machine Learning Research
Reinforcement learning through global stochastic search in N-MDPs

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Efficient gradient estimation for motor control learning

UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence
Self-organizing relays in LTE networks: queuing analysis and algorithms

Proceedings of the 7th International Conference on Network and Services Management
A unified approach to Markov decision problems and performance sensitivity analysis with discounted and average criteria: multichain cases

Automatica (Journal of IFAC)
Learning to pour with a robot arm combining goal and shape learning for dynamic movement primitives

Robotics and Autonomous Systems
A rapid sparsification method for kernel machines in approximate policy iteration

ISNN'12 Proceedings of the 9th international conference on Advances in Neural Networks - Volume Part I
Control of a free-falling cat by policy-based reinforcement learning

ICANN'12 Proceedings of the 22nd international conference on Artificial Neural Networks and Machine Learning - Volume Part II
Sparse gradient-based direct policy search

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part IV
Stochastic policy search for variance-penalized semi-Markov control

Proceedings of the Winter Simulation Conference
Dynamic policy programming

The Journal of Machine Learning Research
Reinforcement learning algorithms with function approximation: Recent advances and applications

Information Sciences: an International Journal

Quantified Score

Hi-index	0.01

Visualization

Abstract

Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies. A similar algorithm was proposed by Kimura, Yamamura, and Kobayashi (1995). The algorithm's chief advantages are that it requires storage of only twice the number of policy parameters, uses one free parameter β ∈ [0, 1] (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of GPOMDP, and show how the correct choice of the parameter β is related to the mixing time of the controlled POMDP. We briefly describe extensions of GPOMDP to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter, Bartlett, & Weaver, 2001) we show how the gradient estimates generated by GPOMDP can be used in both a traditional stochastic gradient algorithm and a conjugate-gradient procedure to find local optima of the average reward.