Likelihood ratio gradient estimation for stochastic systems
Communications of the ACM - Special issue on simulation
How to optimize discrete-event systems from a single sample path by the score function method
Annals of Operations Research
Probability
Practical Issues in Temporal Difference Learning
Machine Learning
Annals of Operations Research - Special issue on sensitivity analysis and optimization of discrete event systems
TD-Gammon, a self-teaching backgammon program, achieves master-level play
Neural Computation
Sensitivity analysis via likelihood ratios
WSC '86 Proceedings of the 18th conference on Winter simulation
Stochastic approximation for Monte Carlo optimization
WSC '86 Proceedings of the 18th conference on Winter simulation
Gradient descent for general reinforcement learning
Proceedings of the 1998 conference on Advances in neural information processing systems II
Learning to Play Chess Using Temporal Differences
Machine Learning
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Neuro-Dynamic Programming
Reinforcement Learning in POMDPs with Function Approximation
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Learning to Cooperate via Policy Search
UAI '00 Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence
Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning
COLT '00 Proceedings of the Thirteenth Annual Conference on Computational Learning Theory
Dynamic Programming and Optimal Control, Vol. II
Dynamic Programming and Optimal Control, Vol. II
A reinforcement learning approach to job-shop scheduling
IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Learning finite-state controllers for partially observable environments
UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
From Perturbation Analysis to Markov Decision Processes and Reinforcement Learning
Discrete Event Dynamic Systems
Optimizing Average Reward Using Discounted Rewards
COLT '01/EuroCOLT '01 Proceedings of the 14th Annual Conference on Computational Learning Theory and and 5th European Conference on Computational Learning Theory
Automatic generation of an agent's basic behaviors
AAMAS '03 Proceedings of the second international joint conference on Autonomous agents and multiagent systems
An introduction to reinforcement learning theory: value function methods
Advanced lectures on machine learning
Least-squares policy iteration
The Journal of Machine Learning Research
Dynamic abstraction in reinforcement learning via clustering
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
The Journal of Machine Learning Research
Basic Ideas for Event-Based Optimization of Markov Systems
Discrete Event Dynamic Systems
Universal parameter optimisation in games based on SPSA
Machine Learning
Fuzzy Policy Reinforcement Learning in Cooperative Multi-robot Systems
Journal of Intelligent and Robotic Systems
Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation
The Journal of Machine Learning Research
Policy Gradient in Continuous Time
The Journal of Machine Learning Research
STEWARD: demo of spatio-textual extraction on the web aiding the retrieval of documents
dg.o '07 Proceedings of the 8th annual international conference on Digital government research: bridging disciplines & domains
Point-Based Value Iteration for Continuous POMDPs
The Journal of Machine Learning Research
Reinforcement Learning, Spike-Time-Dependent Plasticity, and the BCM Rule
Neural Computation
Conditional random fields for multi-agent reinforcement learning
Proceedings of the 24th international conference on Machine learning
Shaping multi-agent systems with gradient reinforcement learning
Autonomous Agents and Multi-Agent Systems
Learning CPG-based Biped Locomotion with a Policy Gradient Method: Application to a Humanoid Robot
International Journal of Robotics Research
Automated Design of Adaptive Controllers for Modular Robots using Reinforcement Learning
International Journal of Robotics Research
Reinforcement learning in the presence of rare events
Proceedings of the 25th international conference on Machine learning
Encoding and decoding spikes for dynamic stimuli
Neural Computation
Reinforcement Learning in Fine Time Discretization
ICANNGA '07 Proceedings of the 8th international conference on Adaptive and Natural Computing Algorithms, Part I
An Extremely Simple Reinforcement Learning Rule for Neural Networks
ISNN '07 Proceedings of the 4th international symposium on Neural Networks: Advances in Neural Networks
Episodic Reinforcement Learning by Logistic Reward-Weighted Regression
ICANN '08 Proceedings of the 18th international conference on Artificial Neural Networks, Part I
A New Natural Policy Gradient by Stationary Distribution Metric
ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Reinforcement Learning: A Tutorial Survey and Recent Advances
INFORMS Journal on Computing
Learning when to stop thinking and do something!
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
An empirical analysis of value function-based and policy search reinforcement learning
Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems - Volume 2
Reordering Sparsification of Kernel Machines in Approximate Policy Iteration
ISNN 2009 Proceedings of the 6th International Symposium on Neural Networks: Advances in Neural Networks - Part II
Machine learning for fast quadrupedal locomotion
AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Geometric variance reduction in Markov chains: application to value function and gradient estimation
AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Reinforcement learning for vulnerability assessment in peer-to-peer networks
IAAI'08 Proceedings of the 20th national conference on Innovative applications of artificial intelligence - Volume 3
Perseus: randomized point-based value iteration for POMDPs
Journal of Artificial Intelligence Research
Natural actor-critic algorithms
Automatica (Journal of IFAC)
Operant matching as a nash equilibrium of an intertemporal game
Neural Computation
A gradient-based reinforcement learning approach to dynamic pricing in partially-observable environments
On-Line Policy Gradient Estimation with Multi-Step Sampling
Discrete Event Dynamic Systems
Transfer Learning for Reinforcement Learning Domains: A Survey
The Journal of Machine Learning Research
A Convergent Online Single Time Scale Actor Critic Algorithm
The Journal of Machine Learning Research
Reinforcement learning for cooperative actions in a partially observable multi-agent system
ICANN'07 Proceedings of the 17th international conference on Artificial neural networks
IEEE Transactions on Evolutionary Computation
Autonomous Agents and Multi-Agent Systems
Simulation-based optimization of Markov decision processes: An empirical process theory approach
Automatica (Journal of IFAC)
The Dynamics of Multi-Agent Reinforcement Learning
Proceedings of the 2010 conference on ECAI 2010: 19th European Conference on Artificial Intelligence
A Generalized Path Integral Control Approach to Reinforcement Learning
The Journal of Machine Learning Research
Hessian matrix distribution for Bayesian policy gradient reinforcement learning
Information Sciences: an International Journal
Decentralized algorithms for adaptive pricing in multiclass loss networks
IEEE/ACM Transactions on Networking (TON)
A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes
The Journal of Machine Learning Research
Reinforcement learning through global stochastic search in N-MDPs
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Efficient gradient estimation for motor control learning
UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence
Self-organizing relays in LTE networks: queuing analysis and algorithms
Proceedings of the 7th International Conference on Network and Services Management
Learning to pour with a robot arm combining goal and shape learning for dynamic movement primitives
Robotics and Autonomous Systems
A rapid sparsification method for kernel machines in approximate policy iteration
ISNN'12 Proceedings of the 9th international conference on Advances in Neural Networks - Volume Part I
Control of a free-falling cat by policy-based reinforcement learning
ICANN'12 Proceedings of the 22nd international conference on Artificial Neural Networks and Machine Learning - Volume Part II
Sparse gradient-based direct policy search
ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part IV
Stochastic policy search for variance-penalized semi-Markov control
Proceedings of the Winter Simulation Conference
The Journal of Machine Learning Research
Reinforcement learning algorithms with function approximation: Recent advances and applications
Information Sciences: an International Journal
Hi-index | 0.01 |
Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies. A similar algorithm was proposed by Kimura, Yamamura, and Kobayashi (1995). The algorithm's chief advantages are that it requires storage of only twice the number of policy parameters, uses one free parameter β ∈ [0, 1] (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of GPOMDP, and show how the correct choice of the parameter β is related to the mixing time of the controlled POMDP. We briefly describe extensions of GPOMDP to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter, Bartlett, & Weaver, 2001) we show how the gradient estimates generated by GPOMDP can be used in both a traditional stochastic gradient algorithm and a conjugate-gradient procedure to find local optima of the average reward.