Learning non-myopically from human-generated reward

Authors:
W. Bradley Knox;Peter Stone
Affiliations:
Massachusetts Institute of Technology, Cambridge, Massachusetts, USA;The University of Texas at Austin, Austin, Texas, USA
Venue:
Proceedings of the 2013 international conference on Intelligent user interfaces
Year:
2013

Citing 9
Cited 1

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Cobot in LambdaMOO: An Adaptive Social Statistics Agent

Autonomous Agents and Multi-Agent Systems
Teachable robots: Understanding human teaching behavior to build more effective robot learners

Artificial Intelligence
A survey of robot learning from demonstration

Robotics and Autonomous Systems
Interactively shaping agents via human reinforcement: the TAMER framework

Proceedings of the fifth international conference on Knowledge capture
Achieving master level play in 9×9 computer go

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 3
Dynamic reward shaping: training a robot by voice

IBERAMIA'10 Proceedings of the 12th Ibero-American conference on Advances in artificial intelligence
Bandit based monte-carlo planning

ECML'06 Proceedings of the 17th European conference on Machine Learning
Teaching a robot to perform task through imitation and on-line feedback

CIARP'11 Proceedings of the 16th Iberoamerican Congress conference on Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications

Teaching agents with human feedback: a demonstration of the TAMER framework

Proceedings of the companion publication of the 2013 international conference on Intelligent user interfaces companion

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent research has demonstrated that human-generated reward signals can be effectively used to train agents to perform a range of reinforcement learning tasks. Such tasks are either episodic - i.e., conducted in unconnected episodes of activity that often end in either goal or failure states - or continuing - i.e., indefinitely ongoing. Another point of difference is whether the learning agent highly discounts the value of future reward - a myopic agent - or conversely values future reward appreciably. In recent work, we found that previous approaches to learning from human reward all used myopic valuation [7]. This study additionally provided evidence for the desirability of myopic valuation in task domains that are both goal-based and episodic. In this paper, we conduct three user studies that examine critical assumptions of our previous research: task episodicity, optimal behavior with respect to a Markov Decision Process, and lack of a failure state in the goal-based task. In the first experiment, we show that converting a simple episodic task to non-episodic (i.e., continuing) task resolves some theoretical issues present in episodic tasks with generally positive reward and - relatedly - enables highly successful learning with non-myopic valuation in multiple user studies. The primary learning algorithm in this paper, which we call "VI-TAMER", is it the first algorithm to successfully learn non-myopically from human-generated reward; we also empirically show that such non-myopic valuation facilitates higher-level understanding of the task. Anticipating the complexity of real-world problems, we perform two subsequent user studies - one with a failure state added - that compare (1) learning when states are updated asynchronously with local bias - i.e., states quickly reachable from the agent's current state are updated more often than other states - to (2) learning with the fully synchronous sweeps across each state in the VI-TAMER algorithm. With these locally biased updates, we find that the general positivity of human reward creates problems even for continuing tasks, revealing a distinct research challenge for future work.