Quantifying counts and costs via classification

Authors:
George Forman
Affiliations:
Hewlett-Packard Labs, Palo Alto, USA
Venue:
Data Mining and Knowledge Discovery
Year:
2008

Citing 15
Cited 12

Robust Classification for Imprecise Environments

Machine Learning
Mining time-changing data streams

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
ThemeRiver: Visualizing Thematic Changes in Large Document Collections

IEEE Transactions on Visualization and Computer Graphics
Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure

Neural Computation
Using Error-Correcting Codes for Text Classification

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
A Response to Webb and Ting's On the Application of ROC Analysis to Predict Classification Performance Under Varying Class Distributions

Machine Learning
KBA: Kernel Boundary Alignment Considering Imbalanced Data Distribution

IEEE Transactions on Knowledge and Data Engineering
Discovering evolutionary theme patterns from text: an exploration of temporal text mining

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Quantifying trends accurately despite classifier error and class imbalance

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Pragmatic text mining: minimizing human effort to quantify many issues in call logs

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Experimental perspectives on learning from imbalanced data

Proceedings of the 24th international conference on Machine learning
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research
Counting positives accurately despite inaccurate classification

ECML'05 Proceedings of the 16th European conference on Machine Learning

Guest editorial: special issue on utility-based data mining

Data Mining and Knowledge Discovery
Quantification and semi-supervised classification methods for handling changes in class distribution

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Network quantification despite biased labels

Proceedings of the Eighth Workshop on Mining and Learning with Graphs
Estimating class proportions in boar semen analysis using the hellinger distance

IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part I
Handling concept drift via ensemble and class distribution estimation technique

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part II
Drift mining in data: A framework for addressing drift in classification

Computational Statistics & Data Analysis
Class distribution estimation based on the Hellinger distance

Information Sciences: an International Journal
Variable-constraint classification and quantification of radiology reports under the ACR Index

Expert Systems with Applications: An International Journal
WagTag: a dog collar accessory for monitoring canine activity levels

Proceedings of the 2013 ACM conference on Pervasive and ubiquitous computing adjunct publication
A unified view of performance metrics: translating threshold choice into expected classification loss

The Journal of Machine Learning Research
Empowering difficult classes with a similarity-based aggregation in multi-class classification problems

Information Sciences: an International Journal
Aggregative quantification for regression

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many business applications track changes over time, for example, measuring the monthly prevalence of influenza incidents. In situations where a classifier is needed to identify the relevant incidents, imperfect classification accuracy can cause substantial bias in estimating class prevalence. The paper defines two research challenges for machine learning. The `quantification' task is to accurately estimate the number of positive cases (or class distribution) in a test set, using a training set that may have a substantially different distribution. The `cost quantification' variant estimates the total cost associated with the positive class, where each case is tagged with a cost attribute, such as the expense to resolve the case. Quantification has a very different utility model from traditional classification research. For both forms of quantification, the paper describes a variety of methods and evaluates them with a suitable methodology, revealing which methods give reliable estimates when training data is scarce, the testing class distribution differs widely from training, and the positive class is rare, e.g., 1% positives. These strengths can make quantification practical for business use, even where classification accuracy is poor.