COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
A sequential algorithm for training text classifiers
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient progressive sampling
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
MetaCost: a general method for making classifiers cost-sensitive
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Support vector machine active learning with applications to text classification
The Journal of Machine Learning Research
Active learning using pre-clustering
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Learning on the border: active learning in imbalanced data classification
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Get another label? improving data quality and data mining using multiple, noisy labelers
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
ECML '07 Proceedings of the 18th European conference on Machine Learning
ECML '07 Proceedings of the 18th European conference on Machine Learning
Feature hashing for large scale multitask learning
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Beyond blacklists: learning to detect malicious web sites from suspicious URLs
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Reducing class imbalance during active learning for named entity annotation
Proceedings of the fifth international conference on Knowledge capture
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
SMOTE: synthetic minority over-sampling technique
Journal of Artificial Intelligence Research
Learning when training data are costly: the effect of class distribution on tree induction
Journal of Artificial Intelligence Research
Exploratory undersampling for class-imbalance learning
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
A large-scale active learning system for topical categorization on the web
Proceedings of the 19th international conference on World wide web
Batch query processing for web search engines
Proceedings of the fourth ACM international conference on Web search and data mining
Inactive learning?: difficulties employing active learning in practice
ACM SIGKDD Explorations Newsletter
Online active inference and learning
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Detecting adversarial advertisements in the wild
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Concept labeling: building text classifiers with minimal supervision
IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Two
Detecting hate speech on the world wide web
LSM '12 Proceedings of the Second Workshop on Language in Social Media
Active learning for imbalanced sentiment classification
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Map to humans and reduce error: crowdsourcing for deduplication applied to digital libraries
Proceedings of the 21st ACM international conference on Information and knowledge management
Explaining data-driven document classifications
MIS Quarterly
Hi-index | 0.00 |
This paper analyses alternative techniques for deploying low-cost human resources for data acquisition for classifier induction in domains exhibiting extreme class imbalance - where traditional labeling strategies, such as active learning, can be ineffective. Consider the problem of building classifiers to help brands control the content adjacent to their on-line advertisements. Although frequent enough to worry advertisers, objectionable categories are rare in the distribution of impressions encountered by most on-line advertisers - so rare that traditional sampling techniques do not find enough positive examples to train effective models. An alternative way to deploy human resources for training-data acquisition is to have them "guide" the learning by searching explicitly for training examples of each class. We show that under extreme skew, even basic techniques for guided learning completely dominate smart (active) strategies for applying human resources to select cases for labeling. Therefore, it is critical to consider the relative cost of search versus labeling, and we demonstrate the tradeoffs for different relative costs. We show that in cost/skew settings where the choice between search and active labeling is equivocal, a hybrid strategy can combine the benefits.