Sample-based learning and search with permanent and transient memories

Authors:
David Silver;Richard S. Sutton;Martin Müller
Affiliations:
University of Alberta, Edmonton, Alberta;University of Alberta, Edmonton, Alberta;University of Alberta, Edmonton, Alberta
Venue:
Proceedings of the 25th international conference on Machine learning
Year:
2008

Citing 10
Cited 12

Integrated architecture for learning, planning, and reacting based on approximating dynamic programming

Proceedings of the seventh international conference (1990) on Machine learning
Reinforcement Learning

Reinforcement Learning
Finite-time Analysis of the Multiarmed Bandit Problem

Machine Learning
Learning to Predict by the Methods of Temporal Differences

Machine Learning
Honte, a go-playing program using neural nets

Machines that learn to play games
Combining online and offline knowledge in UCT

Proceedings of the 24th international conference on Machine learning
On the role of tracking in stationary environments

Proceedings of the 24th international conference on Machine learning
Reinforcement learning of local shape in the game of go

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Temporal difference learning applied to a high-performance game-playing program

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 1
Bandit based monte-carlo planning

ECML'06 Proceedings of the 17th European conference on Machine Learning

Indirect encoding of neural networks for scalable go

PPSN'10 Proceedings of the 11th international conference on Parallel problem solving from nature: Part I
Monte-Carlo tree search and rapid action value estimation in computer Go

Artificial Intelligence
Learning to win by reading manuals in a Monte-Carlo framework

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Empirical evaluation of ad hoc teamwork in the pursuit domain

The 10th International Conference on Autonomous Agents and Multiagent Systems - Volume 2
Towards more intelligent adaptive video game agents: a computational intelligence perspective

Proceedings of the 9th conference on Computing Frontiers
Non-linear Monte-Carlo search in civilization II

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Strong mitigation: nesting search for good policies within search for good reward

Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 1
Learning to win by reading manuals in a monte-carlo framework

Journal of Artificial Intelligence Research
Besting the quiz master: crowdsourcing incremental classification games

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
TEXPLORE: real-time sample-efficient reinforcement learning for robots

Machine Learning
Lifelong learning for acquiring the wisdom of the crowd

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Scalable and efficient bayes-adaptive reinforcement learning based on monte-carlo tree search

Journal of Artificial Intelligence Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a reinforcement learning architecture, Dyna-2, that encompasses both sample-based learning and sample-based search, and that generalises across states during both learning and search. We apply Dyna-2 to high performance Computer Go. In this domain the most successful planning methods are based on sample-based search algorithms, such as UCT, in which states are treated individually, and the most successful learning methods are based on temporal-difference learning algorithms, such as Sarsa, in which linear function approximation is used. In both cases, an estimate of the value function is formed, but in the first case it is transient, computed and then discarded after each move, whereas in the second case it is more permanent, slowly accumulating over many moves and games. The idea of Dyna-2 is for the transient planning memory and the permanent learning memory to remain separate, but for both to be based on linear function approximation and both to be updated by Sarsa. To apply Dyna-2 to 9x9 Computer Go, we use a million binary features in the function approximator, based on templates matching small fragments of the board. Using only the transient memory, Dyna-2 performed at least as well as UCT. Using both memories combined, it significantly outperformed UCT. Our program based on Dyna-2 achieved a higher rating on the Computer Go Online Server than any handcrafted or traditional search based program.