Policy oscillation is overshooting

Authors:
Paul Wagner
Affiliations:
-
Venue:
Neural Networks
Year:
2014

Citing 21
Cited 0

Feature-based methods for large scale dynamic programming

Machine Learning - Special issue on reinforcement learning
Natural gradient works efficiently in learning

Neural Computation
On the existence of fixed points for approximate value iteration and temporal-difference learning

Journal of Optimization Theory and Applications
Kernel-Based Reinforcement Learning

Machine Learning
Technical Update: Least-Squares Temporal Difference Learning

Machine Learning
Least Squares Policy Evaluation Algorithms with Linear Function Approximation

Discrete Event Dynamic Systems
An Analysis of Direct Reinforcement Learning in Non-Markovian Domains

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
On the Existence of Fixed Points for Q-Learning and Sarsa in Partially Observable Domains

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Reinforcement learning for POMDPs based on action values and stochastic optimization

Eighteenth national conference on Artificial intelligence
On Actor-Critic Algorithms

SIAM Journal on Control and Optimization
Least-squares policy iteration

The Journal of Machine Learning Research
Learning tetris using the noisy cross-entropy method

Neural Computation
Machine learning of motor skills for robotics

Machine learning of motor skills for robotics
Natural Actor-Critic

Neurocomputing
An analysis of reinforcement learning with function approximation

Proceedings of the 25th international conference on Machine learning
An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning

Proceedings of the 25th international conference on Machine learning
State-Dependent Exploration for Policy Gradient Methods

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Natural actor-critic algorithms

Automatica (Journal of IFAC)
2010 Special Issue: Parameter-exploring policy gradients

Neural Networks
Algorithms for Reinforcement Learning

Algorithms for Reinforcement Learning
Reinforcement Learning and Dynamic Programming Using Function Approximators

Reinforcement Learning and Dynamic Programming Using Function Approximators

Quantified Score

Hi-index	0.00

Visualization

Abstract

A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting, within the context of non-optimistic policy iteration, a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. Based on empirical findings, we offer a hypothesis that might explain the inferior performance levels and the associated policy degradation phenomenon, and which would partially support the suggested connection. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results by an order of magnitude.