Actor-critic algorithms for hierarchical Markov decision processes

Authors:
Shalabh Bhatnagar;J. Ranjan Panigrahi
Affiliations:
Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560 012, India;SoftJin Technologies Private Limited, Unit No.102, Mobius Tower, EPIP, White Field, Bangalore 560066, India
Venue:
Automatica (Journal of IFAC)
Year:
2006

Citing 7
Cited 1

Asynchronous Stochastic Approximation and Q-Learning

Machine Learning
A one-measurement form of simultaneous perturbation stochastic approximation

Automatica (Journal of IFAC)
Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning

Artificial Intelligence
Actor-Critic--Type Learning Algorithms for Markov Decision Processes

SIAM Journal on Control and Optimization
Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Hierarchical control and learning for markov decision processes

Hierarchical control and learning for markov decision processes
On Actor-Critic Algorithms

SIAM Journal on Control and Optimization

Reinforcement Learning: A Tutorial Survey and Recent Advances

INFORMS Journal on Computing

Quantified Score

Hi-index	22.14

Visualization

Abstract

We consider the problem of control of hierarchical Markov decision processes and develop a simulation based two-timescale actor-critic algorithm in a general framework. We also develop certain approximation algorithms that require less computation and satisfy a performance bound. One of the approximation algorithms is a three-timescale actor-critic algorithm while the other is a two-timescale algorithm, however, which operates in two separate stages. All our algorithms recursively update randomized policies using the simultaneous perturbation stochastic approximation (SPSA) methodology. We briefly present the convergence analysis of our algorithms. We then present numerical experiments on a problem of production planning in semiconductor fabs on which we compare the performance of all algorithms together with policy iteration. Algorithms based on certain Hadamard matrix based deterministic perturbations are found to show the best results.