Real-time reinforcement learning by sequential Actor-Critics and experience replay

Authors:
Paweł Wawrzyński
Affiliations:
Warsaw University of Technology, Institute of Control and Computation Engineering, Nowowiejska 15/19, 00-665 Warsaw, Poland
Venue:
Neural Networks
Year:
2009

Citing 18
Cited 5

Integrated architecture for learning, planning, and reacting based on approximating dynamic programming

Proceedings of the seventh international conference (1990) on Machine learning
Automatic programming of behavior-based robots using reinforcement learning

Artificial Intelligence
Reinforcement learning for robots using neural networks

Reinforcement learning for robots using neural networks
Simulation and the Monte Carlo Method

Simulation and the Monte Carlo Method
Least Squares Policy Evaluation Algorithms with Linear Function Approximation

Discrete Event Dynamic Systems
Learning to Predict by the Methods of Temporal Differences

Machine Learning
Off-Policy Temporal Difference Learning with Function Approximation

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Learning from Scarce Experience

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
An Analysis of Actor/Critic Algorithms Using Eligibility Traces: Reinforcement Learning with Imperfect Value Function

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Policy Improvement for POMDPs Using Normalized Importance Sampling

UAI '01 Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence
Memory Approaches to Reinforcement Learning in Non-Markovian Domains

Memory Approaches to Reinforcement Learning in Non-Markovian Domains
On Actor-Critic Algorithms

SIAM Journal on Control and Optimization
Exploration and apprenticeship learning in reinforcement learning

ICML '05 Proceedings of the 22nd international conference on Machine learning
Reinforcement Learning in Continuous Time and Space

Neural Computation
Using inaccurate models in reinforcement learning

ICML '06 Proceedings of the 23rd international conference on Machine learning
Natural Actor-Critic

Neurocomputing
Reinforcement learning in the presence of rare events

Proceedings of the 25th international conference on Machine learning
Natural actor-critic algorithms

Automatica (Journal of IFAC)

Reward-weighted regression with sample reuse for direct policy search in reinforcement learning

Neural Computation
Fixed point method for autonomous on-line neural network training

Neurocomputing
2013 Special Issue: Autonomous reinforcement learning with experience replay

Neural Networks
Efficient sample reuse in policy gradients with parameter-based exploration

Neural Computation
Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems

Automatica (Journal of IFAC)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Actor-Critics constitute an important class of reinforcement learning algorithms that can deal with continuous actions and states in an easy and natural way. This paper shows how these algorithms can be augmented by the technique of experience replay without degrading their convergence properties, by appropriately estimating the policy change direction. This is achieved by truncated importance sampling applied to the recorded past experiences. It is formally shown that the resulting estimation bias is bounded and asymptotically vanishes, which allows the experience replay-augmented algorithm to preserve the convergence properties of the original algorithm. The technique of experience replay makes it possible to utilize the available computational power to reduce the required number of interactions with the environment considerably, which is essential for real-world applications. Experimental results are presented that demonstrate that the combination of experience replay and Actor-Critics yields extremely fast learning algorithms that achieve successful policies for non-trivial control tasks in considerably short time. Namely, the policies for the cart-pole swing-up [Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Computation, 12(1), 219-245] are obtained after as little as 20 min of the cart-pole time and the policy for Half-Cheetah (a walking 6-degree-of-freedom robot) is obtained after four hours of Half-Cheetah time.