Strong mitigation: nesting search for good policies within search for good reward

Authors:
Jeshua Bratman;Satinder Singh;Jonathan Sorg;Richard Lewis
Affiliations:
University of Michigan;University of Michigan;Facebook;University of Michigan
Venue:
Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 1
Year:
2012

Citing 11
Cited 1

Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Algorithms for Inverse Reinforcement Learning

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Making Rational Decisions Using Adaptive Utility Elicitation

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Gambling in a rigged casino: The adversarial multi-armed bandit problem

FOCS '95 Proceedings of the 36th Annual Symposium on Foundations of Computer Science
Combining expert advice in reactive environments

Journal of the ACM (JACM)
Sample-based learning and search with permanent and transient memories

Proceedings of the 25th international conference on Machine learning
Achieving master level play in 9×9 computer go

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 3
Hierarchical reinforcement learning with the MAXQ value function decomposition

Journal of Artificial Intelligence Research
An experts algorithm for transfer learning

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Bandit based monte-carlo planning

ECML'06 Proceedings of the 17th European conference on Machine Learning
Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective

IEEE Transactions on Autonomous Mental Development

Learning exploration strategies in model-based reinforcement learning

Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent work has defined an optimal reward problem (ORP) in which an agent designer, with an objective reward function that evaluates an agent's behavior, has a choice of what reward function to build into a learning or planning agent to guide its behavior. Existing results on ORP show weak mitigation of limited computational resources, i.e., the existence of reward functions so that agents when guided by them do better than when guided by the objective reward function. These existing results ignore the cost of finding such good reward functions. We define a nested optimal reward and control architecture that achieves strong mitigation of limited computational resources. We show empirically that the designer is better off using the new architecture that spends some of its limited resources learning a good reward function instead of using all of its resources to optimize its behavior with respect to the objective reward function.