Online expectation maximization for reinforcement learning in POMDPs

  • Authors:
  • Miao Liu;Xuejun Liao;Lawrence Carin

  • Affiliations:
  • Duke University, Durham, NC;Duke University, Durham, NC;Duke University, Durham, NC

  • Venue:
  • IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present online nested expectation maximization for model-free reinforcement learning in a POMDP. The algorithm evaluates the policy only in the current learning episode, discarding the episode after the evaluation and memorizing the sufficient statistic, from which the policy is computed in closed-form. As a result, the online algorithm has a time complexity O(n) and a memory complexity O(1), compared to O(n2) and O(n) for the corresponding batch-mode algorithm, where n is the number of learning episodes. The online algorithm, which has a provable convergence, is demonstrated on five benchmark POMDP problems.