Using asymmetric distributions to improve text classifier probability estimates

Authors:
Paul N. Bennett
Affiliations:
Carnegie Mellon University, Pittsburgh, PA
Venue:
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Year:
2003

Citing 12
Cited 15

A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
A sequential algorithm for training text classifiers: corrigendum and additional data

ACM SIGIR Forum
Training algorithms for linear text classifiers

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Large Margin Classification Using the Perceptron Algorithm

Machine Learning - The Eleventh Annual Conference on computational Learning Theory
Hierarchical classification of Web content

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Modeling score distributions for combining the outputs of search engines

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Active Sampling for Class Probability Estimation and Ranking

Machine Learning

Probabilistic score estimation with piecewise logistic regression

ICML '04 Proceedings of the twenty-first international conference on Machine learning
An analysis of the relative hardness of Reuters-21578 subsets: Research Articles

Journal of the American Society for Information Science and Technology
Local sparsity control for naive Bayes with extreme misclassification costs

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
ROC confidence bands: an empirical evaluation

ICML '05 Proceedings of the 22nd international conference on Machine learning
Generating query substitutions

Proceedings of the 15th international conference on World Wide Web
Extreme value theory applied to document retrieval from large collections

Information Retrieval
Partitioned logistic regression for spam filtering

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Neighborhood-Based Local Sensitivity

ECML '07 Proceedings of the 18th European conference on Machine Learning
Modeling the Score Distributions of Relevant and Non-relevant Documents

ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Score distribution models: assumptions, intuition, and robustness to score manipulation

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Online stratified sampling: evaluating classifiers at web-scale

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Variational bayes for modeling score distributions

Information Retrieval
Smooth receiver operating characteristics (smROC) curves

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
A term weighting approach for text categorization

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Counting positives accurately despite inaccurate classification

ECML'05 Proceedings of the 16th European conference on Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text classifiers that give probability estimates are more readily applicable in a variety of scenarios. For example, rather than choosing one set decision threshold, they can be used in a Bayesian risk model to issue a run-time decision which minimizes a user-specified cost function dynamically chosen at prediction time. However, the quality of the probability estimates is crucial. We review a variety of standard approaches to converting scores (and poor probability estimates) from text classifiers to high quality estimates and introduce new models motivated by the intuition that the empirical score distribution for the "extremely irrelevant", "hard to discriminate", and "obviously relevant" items are often significantly different. Finally, we analyze the experimental performance of these models over the outputs of two text classifiers. The analysis demonstrates that one of these models is theoretically attractive (introducing few new parameters while increasing flexibility), computationally efficient, and empirically preferable.