Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

Authors:
Dimitri P. Bertsekas;Huizhen Yu
Affiliations:
Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139;Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139
Venue:
Mathematics of Operations Research
Year:
2012

Citing 26
Cited 0

Asynchronous Stochastic Approximation and Q-Learning

Machine Learning
Linear least-squares algorithms for temporal difference learning

Machine Learning - Special issue on reinforcement learning
Feature-based methods for large scale dynamic programming

Machine Learning - Special issue on reinforcement learning
Asynchronous Stochastic Approximations

SIAM Journal on Control and Optimization
Markov Decision Processes: Discrete Stochastic Dynamic Programming

Markov Decision Processes: Discrete Stochastic Dynamic Programming
Reinforcement Learning

Reinforcement Learning
Parallel and Distributed Computation: Numerical Methods

Parallel and Distributed Computation: Numerical Methods
Neuro-Dynamic Programming

Neuro-Dynamic Programming
Stochastic Approximation for Nonexpansive Maps: Application to Q-Learning Algorithms

SIAM Journal on Control and Optimization
Technical Update: Least-Squares Temporal Difference Learning

Machine Learning
Learning to Predict by the Methods of Temporal Differences

Machine Learning
Off-Policy Temporal Difference Learning with Function Approximation

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning

Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning
On the convergence of optimistic policy iteration

The Journal of Machine Learning Research
A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning

Discrete Event Dynamic Systems
Simulation-based Algorithms for Markov Decision Processes (Communications and Control Engineering)

Simulation-based Algorithms for Markov Decision Processes (Communications and Control Engineering)
Stochastic Learning and Optimization: A Sensitivity-Based Approach (International Series on Discrete Event Dynamic Systems)

Stochastic Learning and Optimization: A Sensitivity-Based Approach (International Series on Discrete Event Dynamic Systems)
Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley Series in Probability and Statistics)

Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley Series in Probability and Statistics)
Brief paper: New algorithms of the Q-learning type

Automatica (Journal of IFAC)
On the convergence of stochastic iterative dynamic programming algorithms

Neural Computation
Dynamic Programming and Optimal Control, Vol. II

Dynamic Programming and Optimal Control, Vol. II
Projected equation methods for approximate solution of large linear systems

Journal of Computational and Applied Mathematics
Fast gradient-descent methods for temporal-difference learning with linear function approximation

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Control Techniques for Complex Networks

Control Techniques for Complex Networks
On Regression-Based Stopping Times

Discrete Event Dynamic Systems
Reinforcement Learning and Dynamic Programming Using Function Approximators

Reinforcement Learning and Dynamic Programming Using Function Approximators

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal state costs or Q-factors. The main difference is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm requires solving an optimal stopping problem. The solution of this problem may be inexact, with a finite number of value iterations, in the spirit of modified policy iteration. The stopping problem structure is incorporated into the standard Q-learning algorithm to obtain a new method that is intermediate between policy iteration and Q-learning/value iteration. Thanks to its special contraction properties, our method overcomes some of the traditional convergence difficulties of modified policy iteration and admits asynchronous deterministic and stochastic iterative implementations, with lower overhead and/or more reliable convergence over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm addresses effectively the inherent difficulties of approximate policy iteration due to inadequate exploration of the state and control spaces.