Asymptotically efficient adaptive control in stochastic regression models
Advances in Applied Mathematics
Associative Reinforcement Learning: A Generate and Test Algorithm
Machine Learning
Cumulated gain-based evaluation of IR techniques
ACM Transactions on Information Systems (TOIS)
Finite-time Analysis of the Multiarmed Bandit Problem
Machine Learning
Gambling in a rigged casino: The adversarial multi-armed bandit problem
FOCS '95 Proceedings of the 36th Annual Symposium on Foundations of Computer Science
Dynamics of bid optimization in online advertisement auctions
Proceedings of the 16th international conference on World Wide Web
The Journal of Machine Learning Research
Revenue analysis of a family of ranking rules for keyword auctions
Proceedings of the 8th ACM conference on Electronic commerce
Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms
Proceedings of the fourth ACM international conference on Web search and data mining
Balancing exploration and exploitation in learning to rank online
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Click shaping to optimize multiple objectives
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
A probabilistic method for inferring preferences from clicks
Proceedings of the 20th ACM international conference on Information and knowledge management
Reusing historical interaction data for faster online learning to rank for IR
Proceedings of the sixth ACM international conference on Web search and data mining
Using maximum coverage to optimize recommendation systems in e-commerce
Proceedings of the 7th ACM conference on Recommender systems
Automatic ad format selection via contextual bandits
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods
ACM Transactions on Information Systems (TOIS)
Hi-index | 0.00 |
We examine the problem of evaluating a policy in the contextual bandit setting using only observations collected during the execution of another policy. We show that policy evaluation can be impossible if the exploration policy chooses actions based on the side information provided at each time step. We then propose and prove the correctness of a principled method for policy evaluation which works when this is not the case, even when the exploration policy is deterministic, as long as each action is explored sufficiently often. We apply this general technique to the problem of offline evaluation of internet advertising policies. Although our theoretical results hold only when the exploration policy chooses ads independent of side information, an assumption that is typically violated by commercial systems, we show how clever uses of the theory provide non-trivial and realistic applications. We also provide an empirical demonstration of the effectiveness of our techniques on real ad placement data.