Practical Issues in Temporal Difference Learning
Machine Learning
TD-Gammon, a self-teaching backgammon program, achieves master-level play
Neural Computation
Stochastic approximation for Monte Carlo optimization
WSC '86 Proceedings of the 18th conference on Winter simulation
Gradient descent for general reinforcement learning
Proceedings of the 1998 conference on Advances in neural information processing systems II
Learning to Play Chess Using Temporal Differences
Machine Learning
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Feedforward Neural Network Methodology
Feedforward Neural Network Methodology
Neuro-Dynamic Programming
Learning to Predict by the Methods of Temporal Differences
Machine Learning
Reinforcement Learning in POMDPs with Function Approximation
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning
COLT '00 Proceedings of the Thirteenth Annual Conference on Computational Learning Theory
Reinforcement Learning in POMDP's via Direct Gradient Ascent
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
A reinforcement learning approach to job-shop scheduling
IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
From Perturbation Analysis to Markov Decision Processes and Reinforcement Learning
Discrete Event Dynamic Systems
Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
The Journal of Machine Learning Research
Basic Ideas for Event-Based Optimization of Markov Systems
Discrete Event Dynamic Systems
STEWARD: demo of spatio-textual extraction on the web aiding the retrieval of documents
dg.o '07 Proceedings of the 8th annual international conference on Digital government research: bridging disciplines & domains
Shaping multi-agent systems with gradient reinforcement learning
Autonomous Agents and Multi-Agent Systems
Non-parametric policy gradients: a unified treatment of propositional and relational domains
Proceedings of the 25th international conference on Machine learning
Reinforcement Learning in Fine Time Discretization
ICANNGA '07 Proceedings of the 8th international conference on Adaptive and Natural Computing Algorithms, Part I
The factored policy-gradient planner
Artificial Intelligence
Structured prediction with reinforcement learning
Machine Learning
On-Line Policy Gradient Estimation with Multi-Step Sampling
Discrete Event Dynamic Systems
Solving deep memory POMDPs with recurrent policy gradients
ICANN'07 Proceedings of the 17th international conference on Artificial neural networks
Incorporating domain models into Bayesian optimization for RL
ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part III
Analysis and improvement of policy gradient estimation
Neural Networks
Self-organizing relays in LTE networks: queuing analysis and algorithms
Proceedings of the 7th International Conference on Network and Services Management
Sparse gradient-based direct policy search
ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part IV
Hi-index | 0.00 |
In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter & Bartlett, 2001), which computes biased estimates of the performance gradient in POMDPs. The algorithm's chief advantages are that it uses only one free parameter β ∈ [0, 1], which has a natural interpretation in terms of bias-variance trade-off, it requires no knowledge of the underlying state, and it can be applied to infinite state, control and observation spaces. We show how the gradient estimates produced by GPOMDP can be used to perform gradient ascent, both with a traditional stochastic-gradient algorithm, and with an algorithm based on conjugate-gradients that utilizes gradient information to bracket maxima in line searches. Experimental results are presented illustrating both the theoretical results of Baxter and Bartlett (2001) on a toy problem, and practical aspects of the algorithms on a number of more realistic problems.