Exploration scavenging

Authors:
John Langford;Alexander Strehl;Jennifer Wortman
Affiliations:
Yahoo! Research, New York, New York;Yahoo! Research, New York, New York;University of Pennsylvania, Philadelphia, PA
Venue:
Proceedings of the 25th international conference on Machine learning
Year:
2008

Citing 8
Cited 8

Asymptotically efficient adaptive control in stochastic regression models

Advances in Applied Mathematics
Associative Reinforcement Learning: A Generate and Test Algorithm

Machine Learning
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
Finite-time Analysis of the Multiarmed Bandit Problem

Machine Learning
Gambling in a rigged casino: The adversarial multi-armed bandit problem

FOCS '95 Proceedings of the 36th Annual Symposium on Foundations of Computer Science
Dynamics of bid optimization in online advertisement auctions

Proceedings of the 16th international conference on World Wide Web
Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

The Journal of Machine Learning Research
Revenue analysis of a family of ranking rules for keyword auctions

Proceedings of the 8th ACM conference on Electronic commerce

Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms

Proceedings of the fourth ACM international conference on Web search and data mining
Balancing exploration and exploitation in learning to rank online

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Click shaping to optimize multiple objectives

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
A probabilistic method for inferring preferences from clicks

Proceedings of the 20th ACM international conference on Information and knowledge management
Reusing historical interaction data for faster online learning to rank for IR

Proceedings of the sixth ACM international conference on Web search and data mining
Using maximum coverage to optimize recommendation systems in e-commerce

Proceedings of the 7th ACM conference on Recommender systems
Automatic ad format selection via contextual bandits

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We examine the problem of evaluating a policy in the contextual bandit setting using only observations collected during the execution of another policy. We show that policy evaluation can be impossible if the exploration policy chooses actions based on the side information provided at each time step. We then propose and prove the correctness of a principled method for policy evaluation which works when this is not the case, even when the exploration policy is deterministic, as long as each action is explored sufficiently often. We apply this general technique to the problem of offline evaluation of internet advertising policies. Although our theoretical results hold only when the exploration policy chooses ads independent of side information, an assumption that is typically violated by commercial systems, we show how clever uses of the theory provide non-trivial and realistic applications. We also provide an empirical demonstration of the effectiveness of our techniques on real ad placement data.