Bounding the suboptimality of reusing subproblems

Authors:
Michael Bowling;Manuela Veloso
Affiliations:
Computer Science Department, Carnegie Mellon University, Pittsburgh, PA;Computer Science Department, Carnegie Mellon University, Pittsburgh, PA
Venue:
IJCAI'99 Proceedings of the 16th international joint conference on Artificial intelligence - Volume 2
Year:
1999

Citing 7
Cited 1

Dynamic programming: deterministic and stochastic models

Dynamic programming: deterministic and stochastic models
Flexible strategy learning: analogical replay of problem solving episodes

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
Between MOPs and Semi-MOP: Learning, Planning & Representing Knowledge at Multiple Temporal Scales

Between MOPs and Semi-MOP: Learning, Planning & Representing Knowledge at Multiple Temporal Scales
Reinforcement learning: a survey

Journal of Artificial Intelligence Research
Decomposition techniques for planning in stochastic domains

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Hierarchical solution of Markov decision processes using macro-actions

UAI'98 Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence
Flexible decomposition algorithms for weakly coupled Markov decision problems

UAI'98 Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence

Approximation Techniques in Multiagent Learning

Proceedings of the 5th International Symposium on Abstraction, Reformulation and Approximation

Quantified Score

Hi-index	0.00

Visualization

Abstract

We are interested in the problem of determining a course of action to achieve a desired objective in a non-deterministic environment. Markov decision processes (MDPs) provide a framework for representing this action selection problem, and there are a number of algorithms that learn optimal policies within this formulation. This framework has also been used to study state space abstraction, problem decomposition, and policy reuse. These techniques sacrifice optimality of their solution for improved learning speed. In this paper we examine the suboptimality of reusing policies that are solutions to subproblems. This is done within a restricted class of MDPs, namely those where non-zero reward is received only upon reaching a goal state. We introduce the definition of a subproblem within this class and provide motivation for how reuse of subproblem solutions can speed up learning. The contribution of this paper is the derivation of a tight bound on the loss in optimality from this reuse. We examine a bound that is based on Bellman error, which applies to all MDPs, but is not tight enough to be useful. We contribute our own theoretical result that gives an empirically tight bound on this suboptimality.