Online expectation maximization for reinforcement learning in POMDPs

Authors:
Miao Liu;Xuejun Liao;Lawrence Carin
Affiliations:
Duke University, Durham, NC;Duke University, Durham, NC;Duke University, Durham, NC
Venue:
IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Year:
2013

Citing 10
Cited 0

Hierarchical mixtures of experts and the EM algorithm

Neural Computation
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Scalable Internal-State Policy-Gradient Methods for POMDPs

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Reinforcement learning with selective perception and hidden state

Reinforcement learning with selective perception and hidden state
Utile distinction hidden Markov models

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Model-free reinforcement learning as mixture learning

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Multi-task Reinforcement Learning in Partially Observable Stochastic Environments

The Journal of Machine Learning Research
Planning and acting in partially observable stochastic domains

Artificial Intelligence
A Modified Memory-Based Reinforcement Learning Method for Solving POMDP Problems

Neural Processing Letters
Learning finite-state controllers for partially observable environments

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present online nested expectation maximization for model-free reinforcement learning in a POMDP. The algorithm evaluates the policy only in the current learning episode, discarding the episode after the evaluation and memorizing the sufficient statistic, from which the policy is computed in closed-form. As a result, the online algorithm has a time complexity O(n) and a memory complexity O(1), compared to O(n2) and O(n) for the corresponding batch-mode algorithm, where n is the number of learning episodes. The online algorithm, which has a provable convergence, is demonstrated on five benchmark POMDP problems.