Quantifying trends accurately despite classifier error and class imbalance

Authors:
George Forman
Affiliations:
Hewlett-Packard Labs, Palo Alto, CA
Venue:
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2006

Citing 9
Cited 10

Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
ThemeRiver: Visualizing Thematic Changes in Large Document Collections

IEEE Transactions on Visualization and Computer Graphics
Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure

Neural Computation
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
A Response to Webb and Ting's On the Application of ROC Analysis to Predict Classification Performance Under Varying Class Distributions

Machine Learning
KBA: Kernel Boundary Alignment Considering Imbalanced Data Distribution

IEEE Transactions on Knowledge and Data Engineering
Discovering evolutionary theme patterns from text: an exploration of temporal text mining

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Pragmatic text mining: minimizing human effort to quantify many issues in call logs

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Counting positives accurately despite inaccurate classification

ECML'05 Proceedings of the 16th European conference on Machine Learning

Tackling concept drift by temporal inductive transfer

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Pragmatic text mining: minimizing human effort to quantify many issues in call logs

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Non-stationary data sequence classification using online class priors estimation

Pattern Recognition
Scaling up text classification for large file systems

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Quantifying counts and costs via classification

Data Mining and Knowledge Discovery
Classification and Quantification Based on Image Analysis for Sperm Samples with Uncertain Damaged/Intact Cell Proportions

ICIAR '08 Proceedings of the 5th international conference on Image Analysis and Recognition
Quantifying the proportion of damaged sperm cells based on image analysis and neural networks

SMO'08 Proceedings of the 8th conference on Simulation, modelling and optimization
Transfer estimation of evolving class priors in data stream classification

Pattern Recognition
Class distribution estimation based on the Hellinger distance

Information Sciences: an International Journal
Aggregative quantification for regression

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper promotes a new task for supervised machine learning research: quantification - the pursuit of learning methods for accurately estimating the class distribution of a test set, with no concern for predictions on individual cases. A variant for cost quantification addresses the need to total up costs according to categories predicted by imperfect classifiers. These tasks cover a large and important family of applications that measure trends over time.The paper establishes a research methodology, and uses it to evaluate several proposed methods that involve selecting the classification threshold in a way that would spoil the accuracy of individual classifications. In empirical tests, Median Sweep methods show outstanding ability to estimate the class distribution, despite wide disparity in testing and training conditions. The paper addresses shifting class priors and costs, but not concept drift in general.