Bootstrapping monte carlo tree search with an imperfect heuristic

Authors:
Truong-Huy Dinh Nguyen;Wee-Sun Lee;Tze-Yun Leong
Affiliations:
National University of Singapore, Singapore, Singapore;National University of Singapore, Singapore, Singapore;National University of Singapore, Singapore, Singapore
Venue:
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II
Year:
2012

Citing 8
Cited 0

Finite-time Analysis of the Multiarmed Bandit Problem

Machine Learning
A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes

Machine Learning
Combining online and offline knowledge in UCT

Proceedings of the 24th international conference on Machine learning
Simulation-based approach to general game playing

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 1
UCT for tactical assault planning in real-time strategy games

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Efficient selectivity and backup operators in Monte-Carlo tree search

CG'06 Proceedings of the 5th international conference on Computers and games
Bandit based monte-carlo planning

ECML'06 Proceedings of the 17th European conference on Machine Learning
Adding expert knowledge and exploration in monte-carlo tree search

ACG'09 Proceedings of the 12th international conference on Advances in Computer Games

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of using a heuristic policy to improve the value approximation by the Upper Confidence Bound applied in Trees (UCT) algorithm in non-adversarial settings such as planning with large-state space Markov Decision Processes. Current improvements to UCT focus on either changing the action selection formula at the internal nodes or the rollout policy at the leaf nodes of the search tree. In this work, we propose to add an auxiliary arm to each of the internal nodes, and always use the heuristic policy to roll out simulations at the auxiliary arms. The method aims to get fast convergence to optimal values at states where the heuristic policy is optimal, while retaining similar approximation as the original UCT at other states. We show that bootstrapping with the proposed method in the new algorithm, UCT-Aux, performs better compared to the original UCT algorithm and its variants in two benchmark experiment settings. We also examine conditions under which UCT-Aux works well.