Approximate stochastic annealing for online control of infinite horizon Markov decision processes

Authors:
Jiaqiao Hu;Hyeong Soo Chang
Affiliations:
Department of Applied Mathematics and Statistics, State University of New York, Stony Brook, NY 11794, USA;Department of Computer Science and Engineering, Sogang University, Seoul, Republic of Korea
Venue:
Automatica (Journal of IFAC)
Year:
2012

Citing 10
Cited 0

Cooling schedules for optimal annealing

Mathematics of Operations Research
Asynchronous Stochastic Approximation and Q-Learning

Machine Learning
Convergence Results for Single-Step On-PolicyReinforcement-Learning Algorithms

Machine Learning
Learning to Predict by the Methods of Temporal Differences

Machine Learning
Introduction to Stochastic Search and Optimization

Introduction to Stochastic Search and Optimization
On Actor-Critic Algorithms

SIAM Journal on Control and Optimization
An Adaptive Sampling Algorithm for Solving Markov Decision Processes

Operations Research
Brief paper: New algorithms of the Q-learning type

Automatica (Journal of IFAC)
Reinforcement Learning: A Tutorial Survey and Recent Advances

INFORMS Journal on Computing
Finite time analysis of the pursuit algorithm for learning automata

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics

Quantified Score

Hi-index	22.14

Visualization

Abstract

We present an online simulation-based algorithm called Approximate Stochastic Annealing (ASA) for solving infinite-horizon finite state-action space Markov decision processes (MDPs). The algorithm estimates the optimal policy by sampling at each iteration from a probability distribution function over the policy space, which is updated iteratively based on the Q-function estimates obtained via a recursion of Q-learning type. By exploiting a novel connection of ASA to the stochastic approximation method, we show that the sequence of distribution functions generated by the algorithm converges to a degenerated distribution that concentrates only on the optimal policy. Numerical examples are also provided to illustrate the algorithm.