Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Algorithms for Inverse Reinforcement Learning
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Making Rational Decisions Using Adaptive Utility Elicitation
Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Gambling in a rigged casino: The adversarial multi-armed bandit problem
FOCS '95 Proceedings of the 36th Annual Symposium on Foundations of Computer Science
Combining expert advice in reactive environments
Journal of the ACM (JACM)
Sample-based learning and search with permanent and transient memories
Proceedings of the 25th international conference on Machine learning
Achieving master level play in 9×9 computer go
AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 3
Hierarchical reinforcement learning with the MAXQ value function decomposition
Journal of Artificial Intelligence Research
An experts algorithm for transfer learning
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Bandit based monte-carlo planning
ECML'06 Proceedings of the 17th European conference on Machine Learning
Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective
IEEE Transactions on Autonomous Mental Development
Learning exploration strategies in model-based reinforcement learning
Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems
Hi-index | 0.00 |
Recent work has defined an optimal reward problem (ORP) in which an agent designer, with an objective reward function that evaluates an agent's behavior, has a choice of what reward function to build into a learning or planning agent to guide its behavior. Existing results on ORP show weak mitigation of limited computational resources, i.e., the existence of reward functions so that agents when guided by them do better than when guided by the objective reward function. These existing results ignore the cost of finding such good reward functions. We define a nested optimal reward and control architecture that achieves strong mitigation of limited computational resources. We show empirically that the designer is better off using the new architecture that spends some of its limited resources learning a good reward function instead of using all of its resources to optimize its behavior with respect to the objective reward function.