Using classifier cascades for scalable e-mail classification

Authors:
Jay Pujara;Hal Daumé, III;Lise Getoor
Affiliations:
University of Maryland, College Park, MD;University of Maryland, College Park, MD;University of Maryland, College Park, MD
Venue:
Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
Year:
2011

Citing 5
Cited 2

Hierarchically Classifying Documents Using Very Few Words

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Test-Cost Sensitive Naive Bayes Classification

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Active Feature-Value Acquisition for Classifier Induction

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Cost-sensitive classification: empirical evaluation of a hybrid genetic decision tree induction algorithm

Journal of Artificial Intelligence Research
Designing efficient cascaded classifiers: tradeoff between accuracy and cost

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

Besting the quiz master: crowdsourcing incremental classification games

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
A large-scale empirical analysis of email spam detection through network characteristics in a stand-alone enterprise

Computer Networks: The International Journal of Computer and Telecommunications Networking

Quantified Score

Hi-index	0.00

Visualization

Abstract

In many real-world scenarios, we must make judgments in the presence of computational constraints. One common computational constraint arises when the features used to make a judgment each have differing acquisition costs, but there is a fixed total budget for a set of judgments. Particularly when there are a large number of classifications that must be made in a real-time, an intelligent strategy for optimizing accuracy versus computational costs is essential. E-mail classification is an area where accurate and timely results require such a trade-off. We identify two scenarios where intelligent feature acquisition can improve classifier performance. In granular classification we seek to classify e-mails with increasingly specific labels structured in a hierarchy, where each level of the hierarchy requires a different trade-off between cost and accuracy. In load-sensitive classification, we classify a set of instances within an arbitrary total budget for acquiring features. Our method, Adaptive Classifier Cascades (ACC), designs a policy to combine a series of base classifiers with increasing computational costs given a desired trade-off between cost and accuracy. Using this method, we learn a relationship between feature costs and label hierarchies, for granular classification and cost budgets, for load-sensitive classification. We evaluate our method on real-world e-mail datasets with realistic estimates of feature acquisition cost, and we demonstrate superior results when compared to baseline classifiers that do not have a granular, cost-sensitive feature acquisition policy.