Raising the baseline for high-precision text classifiers

Authors:
Aleksander Kolcz;Wen-tau Yih
Affiliations:
Microsoft;Microsoft
Venue:
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2007

Citing 20
Cited 8

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
Bayesian Network Classifiers

Machine Learning - Special issue on learning with probabilistic representations
Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?

Machine Learning
A statistical approach to the spam problem

Linux Journal
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
WBCsvm: Weighted Bayesian Classification based on Support Vector Machines

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Lazy Bayesian Rules: A Lazy Semi-Naive Bayesian Learning Technique Competitive to Boosting Decision Trees

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
A refinement approach to handling model misfit in text categorization

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Ensemble Modeling Through Multiplicative Adjustment of Class Probability

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Supervised term weighting for automated text categorization

Proceedings of the 2003 ACM symposium on Applied computing
Feature selection using linear classifier weights: interaction with classification models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Sequential conditional Generalized Iterative Scaling

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Local sparsity control for naive Bayes with extreme misclassification costs

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Some Effective Techniques for Naive Bayes Text Classification

IEEE Transactions on Knowledge and Data Engineering
The foundations of cost-sensitive learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Beyond TFIDF weighting for text categorization in the vector space model

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Techniques for improving the performance of naive bayes for text classification

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
An anti-noise text categorization method based on support vector machines

AWIC'05 Proceedings of the Third international conference on Advances in Web Intelligence
Support vector machines for spam categorization

IEEE Transactions on Neural Networks

Partitioned logistic regression for spam filtering

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Topic model methods for automatically identifying out-of-scope resources

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
A survey of emerging approaches to spam filtering

ACM Computing Surveys (CSUR)
Using the absolute difference of term occurrence probabilities in binary text categorization

Applied Intelligence
Confidence-Based incremental classification for objects with limited attributes in vertical search

IEA/AIE'12 Proceedings of the 25th international conference on Industrial Engineering and Other Applications of Applied Intelligent Systems: advanced research in applied artificial intelligence
Measuring word relatedness using heterogeneous vector space models

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Automatic classification of documents in cold-start scenarios

Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
Generalized Dirichlet priors for Naïve Bayesian classifiers with multinomial models in document classification

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many important application areas of text classifiers demand high precision andit is common to compare prospective solutions to the performance of Naive Bayes. This baseline is usually easy to improve upon, but in this work we demonstrate that appropriate document representation can make out performing this classifier much more challenging. Most importantly, we provide a link between Naive Bayes and the logarithmic opinion pooling of the mixture-of-experts framework, which dictates a particular type of document length normalization. Motivated by document-specific feature selection we propose monotonic constraints on document term weighting, which is shown as an effective method of fine-tuning document representation. The discussion is supported by experiments using three large email corpora corresponding to the problem of spam detection, where high precision is of particular importance.