Why label when you can search?: alternatives to active learning for applying human resources to build classification models under extreme class imbalance

Authors:
Josh Attenberg;Foster Provost
Affiliations:
Polytechnic Institute of NYU, Brooklyn, NY, USA;NYU Stern School of Business, New York, NY, USA
Venue:
Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2010

Citing 19
Cited 10

Query by committee

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
MetaCost: a general method for making classifiers cost-sensitive

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Support vector machine active learning with applications to text classification

The Journal of Machine Learning Research
Active learning using pre-clustering

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Learning on the border: active learning in imbalanced data classification

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Get another label? improving data quality and data mining using multiple, noisy labelers

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Dual Strategy Active Learning

ECML '07 Proceedings of the 18th European conference on Machine Learning
Active Class Selection

ECML '07 Proceedings of the 18th European conference on Machine Learning
Feature hashing for large scale multitask learning

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Beyond blacklists: learning to detect malicious web sites from suspicious URLs

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Reducing class imbalance during active learning for named entity annotation

Proceedings of the fifth international conference on Knowledge capture
Active learning with sampling by uncertainty and density for word sense disambiguation and text classification

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Taking into account the differences between actively and passively acquired data: the case of active learning with support vector machines for imbalanced datasets

NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research
Exploratory undersampling for class-imbalance learning

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
A large-scale active learning system for topical categorization on the web

Proceedings of the 19th international conference on World wide web

Batch query processing for web search engines

Proceedings of the fourth ACM international conference on Web search and data mining
Inactive learning?: difficulties employing active learning in practice

ACM SIGKDD Explorations Newsletter
Online active inference and learning

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Detecting adversarial advertisements in the wild

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Concept labeling: building text classifiers with minimal supervision

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Two
Detecting hate speech on the world wide web

LSM '12 Proceedings of the Second Workshop on Language in Social Media
Active learning for imbalanced sentiment classification

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Map to humans and reduce error: crowdsourcing for deduplication applied to digital libraries

Proceedings of the 21st ACM international conference on Information and knowledge management
Machine learning for targeted display advertising: transfer learning in action

Machine Learning
Explaining data-driven document classifications

MIS Quarterly

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper analyses alternative techniques for deploying low-cost human resources for data acquisition for classifier induction in domains exhibiting extreme class imbalance - where traditional labeling strategies, such as active learning, can be ineffective. Consider the problem of building classifiers to help brands control the content adjacent to their on-line advertisements. Although frequent enough to worry advertisers, objectionable categories are rare in the distribution of impressions encountered by most on-line advertisers - so rare that traditional sampling techniques do not find enough positive examples to train effective models. An alternative way to deploy human resources for training-data acquisition is to have them "guide" the learning by searching explicitly for training examples of each class. We show that under extreme skew, even basic techniques for guided learning completely dominate smart (active) strategies for applying human resources to select cases for labeling. Therefore, it is critical to consider the relative cost of search versus labeling, and we demonstrate the tradeoffs for different relative costs. We show that in cost/skew settings where the choice between search and active labeling is equivocal, a hybrid strategy can combine the benefits.