Counting positives accurately despite inaccurate classification

Authors:
George Forman
Affiliations:
Hewlett-Packard Labs, Palo Alto, CA
Venue:
ECML'05 Proceedings of the 16th European conference on Machine Learning
Year:
2005

Citing 4
Cited 14

Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Using asymmetric distributions to improve text classifier probability estimates

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research

Tackling concept drift by temporal inductive transfer

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Quantifying trends accurately despite classifier error and class imbalance

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Pragmatic text mining: minimizing human effort to quantify many issues in call logs

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Bootstrap FDA for counting positives accurately in imprecise environments

Pattern Recognition
Quantifying counts and costs via classification

Data Mining and Knowledge Discovery
Classification and Quantification Based on Image Analysis for Sperm Samples with Uncertain Damaged/Intact Cell Proportions

ICIAR '08 Proceedings of the 5th international conference on Image Analysis and Recognition
Quantifying the proportion of damaged sperm cells based on image analysis and neural networks

SMO'08 Proceedings of the 8th conference on Simulation, modelling and optimization
Quantification and semi-supervised classification methods for handling changes in class distribution

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Joint cutoff probabilistic estimation using simulation: a mailing campaign application

IDEAL'07 Proceedings of the 8th international conference on Intelligent data engineering and automated learning
Network quantification despite biased labels

Proceedings of the Eighth Workshop on Mining and Learning with Graphs
Smooth receiver operating characteristics (smROC) curves

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Handling concept drift via ensemble and class distribution estimation technique

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part II
Class distribution estimation based on the Hellinger distance

Information Sciences: an International Journal
Aggregative quantification for regression

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most supervised machine learning research assumes the training set is a random sample from the target population, thus the class distribution is invariant. In real world situations, however, the class distribution changes, and is known to erode the effectiveness of classifiers and calibrated probability estimators. This paper focuses on the problem of accurately estimating the number of positives in the test set—quantification—as opposed to classifying individual cases accuratel y. It compares three methods: classify & count, an adjusted variant, and a mixture model. An empirical evaluation on a text classification benchmark reveals that the simple method is consistently biased, and that the mixture model is surprisingly effective even when positives are very scarce in the training set—a common case in information retrieval.