Learning to trade off between exploration and exploitation in multiclass bandit prediction

Authors:
Hamed Valizadegan;Rong Jin;Shijun Wang
Affiliations:
University of Pittsburgh, Pittsburgh, PA, USA;Michigan State University, East Lansing, MI, USA;National Institute of Health, Bethesda, MD, USA
Venue:
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2011

Citing 11
Cited 1

Finite-time Analysis of the Multiarmed Bandit Problem

Machine Learning
PAC Bounds for Multi-armed Bandit and Markov Decision Processes

COLT '02 Proceedings of the 15th Annual Conference on Computational Learning Theory
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
The Sample Complexity of Exploration in the Multi-Armed Bandit Problem

The Journal of Machine Learning Research
Prediction, Learning, and Games

Prediction, Learning, and Games
Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

The Journal of Machine Learning Research
Efficient bandit algorithms for online multiclass prediction

Proceedings of the 25th international conference on Machine learning
A contextual-bandit approach to personalized news article recommendation

Proceedings of the 19th international conference on World wide web
Exploitation and exploration in a performance based contextual advertising system

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
Multi-armed bandit algorithms and empirical evaluation

ECML'05 Proceedings of the 16th European conference on Machine Learning

Multiclass classification with bandit feedback using adaptive regularization

Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study multi-class bandit prediction, an online learning problem where the learner only receives a partial feedback in each trial indicating whether the predicted class label is correct. The exploration vs. exploitation tradeoff strategy is a well-known technique for online learning with incomplete feedback (i.e., bandit setup). Banditron [8], a multi-class online learning algorithm for bandit setting, maximizes the run-time gain by balancing between exploration and exploitation with a fixed tradeoff parameter. The performance of Banditron can be quite sensitive to the choice of the tradeoff parameter and therefore effective algorithms to automatically tune this parameter is desirable. In this paper, we propose three learning strategies to automatically adjust the tradeoff parameter for Banditron. Our extensive empirical study with multiple real-world data sets verifies the efficacy of the proposed approach in learning the exploration vs. exploitation tradeoff parameter.