Proceedings of the seventh international conference (1990) on Machine learning
Automatic programming of behavior-based robots using reinforcement learning
Artificial Intelligence
Reinforcement learning for robots using neural networks
Reinforcement learning for robots using neural networks
Simulation and the Monte Carlo Method
Simulation and the Monte Carlo Method
Least Squares Policy Evaluation Algorithms with Linear Function Approximation
Discrete Event Dynamic Systems
Learning to Predict by the Methods of Temporal Differences
Machine Learning
Off-Policy Temporal Difference Learning with Function Approximation
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Learning from Scarce Experience
ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Policy Improvement for POMDPs Using Normalized Importance Sampling
UAI '01 Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence
Memory Approaches to Reinforcement Learning in Non-Markovian Domains
Memory Approaches to Reinforcement Learning in Non-Markovian Domains
SIAM Journal on Control and Optimization
Exploration and apprenticeship learning in reinforcement learning
ICML '05 Proceedings of the 22nd international conference on Machine learning
Reinforcement Learning in Continuous Time and Space
Neural Computation
Using inaccurate models in reinforcement learning
ICML '06 Proceedings of the 23rd international conference on Machine learning
Neurocomputing
Reinforcement learning in the presence of rare events
Proceedings of the 25th international conference on Machine learning
Natural actor-critic algorithms
Automatica (Journal of IFAC)
Efficient sample reuse in policy gradients with parameter-based exploration
Neural Computation
Hi-index | 0.00 |
Actor-Critics constitute an important class of reinforcement learning algorithms that can deal with continuous actions and states in an easy and natural way. This paper shows how these algorithms can be augmented by the technique of experience replay without degrading their convergence properties, by appropriately estimating the policy change direction. This is achieved by truncated importance sampling applied to the recorded past experiences. It is formally shown that the resulting estimation bias is bounded and asymptotically vanishes, which allows the experience replay-augmented algorithm to preserve the convergence properties of the original algorithm. The technique of experience replay makes it possible to utilize the available computational power to reduce the required number of interactions with the environment considerably, which is essential for real-world applications. Experimental results are presented that demonstrate that the combination of experience replay and Actor-Critics yields extremely fast learning algorithms that achieve successful policies for non-trivial control tasks in considerably short time. Namely, the policies for the cart-pole swing-up [Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Computation, 12(1), 219-245] are obtained after as little as 20 min of the cart-pole time and the policy for Half-Cheetah (a walking 6-degree-of-freedom robot) is obtained after four hours of Half-Cheetah time.