New inference strategies for solving Markov decision processes using reversible jump MCMC

Authors:
Matt Hoffman;Hendrik Kueck;Nando de Freitas;Arnaud Doucet
Affiliations:
University of British Columbia, Vancouver, BC, Canada;University of British Columbia, Vancouver, BC, Canada;University of British Columbia, Vancouver, BC, Canada;University of British Columbia, Vancouver, BC, Canada
Venue:
UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
Year:
2009

Citing 7
Cited 1

Using expectation-maximization for reinforcement learning

Neural Computation
Marginal maximum a posteriori estimation using Markov chain Monte Carlo

Statistics and Computing
Mean Shift, Mode Seeking, and Clustering

IEEE Transactions on Pattern Analysis and Machine Intelligence
PEGASUS: A policy search method for large MDPs and POMDPs

UAI '00 Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence
Introduction to Stochastic Search and Optimization

Introduction to Stochastic Search and Optimization
Probabilistic inference for solving discrete and continuous state Markov Decision Processes

ICML '06 Proceedings of the 23rd international conference on Machine learning
Planning and Moving in Dynamic Environments

Creating Brain-Like Intelligence

Analyzing and escaping local optima in planning as inference for partially observable domains

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we build on previous work which uses inferences techniques, in particular Markov Chain Monte Carlo (MCMC) methods, to solve parameterized control problems. We propose a number of modifications in order to make this approach more practical in general, higher-dimensional spaces. We first introduce a new target distribution which is able to incorporate more reward information from sampled trajectories. We also show how to break strong correlations between the policy parameters and sampled trajectories in order to sample more freely. Finally, we show how to incorporate these techniques in a principled manner to obtain estimates of the optimal policy.