Hierarchical Knowledge Gradient for Sequential Sampling

Authors:
Martijn R.K. Mes;Warren B. Powell;Peter I. Frazier
Affiliations:
-;-;-
Venue:
The Journal of Machine Learning Research
Year:
2011

Citing 26
Cited 1

From on-line to batch learning

COLT '89 Proceedings of the second annual workshop on Computational learning theory
Learning in embedded systems

Learning in embedded systems
A gradient approach for smartly allocating computing budget for discrete event simulation

WSC '96 Proceedings of the 28th conference on Winter simulation
Neuro-Dynamic Programming

Neuro-Dynamic Programming
Efficient Global Optimization of Expensive Black-Box Functions

Journal of Global Optimization
Finite-time Analysis of the Multiarmed Bandit Problem

Machine Learning
Near-Optimal Reinforcement Learning in Polynomial Time

Machine Learning
On Bayesian Methods for Seeking the Extremum

Proceedings of the IFIP Technical Conference
PAC Bounds for Multi-armed Bandit and Markov Decision Processes

COLT '02 Proceedings of the 15th Annual Conference on Computational Learning Theory
New Two-Stage and Sequential Procedures for Selecting the Best Simulated System

Operations Research
Introduction to Stochastic Search and Optimization

Introduction to Stochastic Search and Optimization
R-max - a general polynomial time algorithm for near-optimal reinforcement learning

The Journal of Machine Learning Research
Online convex optimization in the bandit setting: gradient descent without a gradient

SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Global Optimization of Stochastic Black-Box Systems via Sequential Kriging Meta-Models

Journal of Global Optimization
Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)

Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)
Online decision problems with large strategy sets

Online decision problems with large strategy sets
Multi-armed bandits in metric spaces

STOC '08 Proceedings of the fortieth annual ACM symposium on Theory of computing
Empirical Bernstein stopping

Proceedings of the 25th international conference on Machine learning
A Knowledge-Gradient Policy for Sequential Information Collection

SIAM Journal on Control and Optimization
An Approximate Dynamic Programming Algorithm for Large-Scale Fleet Management: A Case Application

Transportation Science
An informational approach to the global optimization of expensive-to-evaluate functions

Journal of Global Optimization
Practical bayesian optimization

Practical bayesian optimization
Widely convergent method for finding multiple solutions of simultaneous nonlinear equations

IBM Journal of Research and Development
Sequential Sampling to Myopically Maximize the Expected Value of Information

INFORMS Journal on Computing
Pure exploration in multi-armed bandits problems

ALT'09 Proceedings of the 20th international conference on Algorithmic learning theory
Opportunity Cost and OCBA Selection Procedures in Ordinal Optimization for a Fixed Number of Alternative Systems

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews

Optimal learning for sequential sampling with non-parametric beliefs

Journal of Global Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a sequential sampling policy for noisy discrete global optimization and ranking and selection, in which we aim to efficiently explore a finite set of alternatives before selecting an alternative as best when exploration stops. Each alternative may be characterized by a multi-dimensional vector of categorical and numerical attributes and has independent normal rewards. We use a Bayesian probability model for the unknown reward of each alternative and follow a fully sequential sampling policy called the knowledge-gradient policy. This policy myopically optimizes the expected increment in the value of sampling information in each time period. We propose a hierarchical aggregation technique that uses the common features shared by alternatives to learn about many alternatives from even a single measurement. This approach greatly reduces the measurement effort required, but it requires some prior knowledge on the smoothness of the function in the form of an aggregation function and computational issues limit the number of alternatives that can be easily considered to the thousands. We prove that our policy is consistent, finding a globally optimal alternative when given enough measurements, and show through simulations that it performs competitively with or significantly better than other policies.