Severe class imbalance: why better algorithms aren't the answer

Authors:
Chris Drummond;Robert C. Holte
Affiliations:
Institute for Information Technology, National Research Council Canada, Ottawa, Ontario, Canada;Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada
Venue:
ECML'05 Proceedings of the 16th European conference on Machine Learning
Year:
2005

Citing 8
Cited 7

C4.5: programs for machine learning

C4.5: programs for machine learning
Robust classification systems for imprecise environments

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
The base-rate fallacy and its implications for the difficulty of intrusion detection

CCS '99 Proceedings of the 6th ACM conference on Computer and communications security
Explicitly representing expected cost: an alternative to ROC representation

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive Fraud Detection

Data Mining and Knowledge Discovery
Improving Minority Class Prediction Using Case-Specific Feature Weights

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
The Case against Accuracy Estimation for Comparing Induction Algorithms

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
AUC: a statistically consistent and more discriminating measure than accuracy

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

Cost-Based Sampling of Individual Instances

Canadian AI '09 Proceedings of the 22nd Canadian Conference on Artificial Intelligence: Advances in Artificial Intelligence
Cost-sensitive classifier evaluation using cost curves

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Classifying severely imbalanced data

Canadian AI'11 Proceedings of the 24th Canadian conference on Advances in artificial intelligence
Evaluating misclassifications in imbalanced data

ECML'06 Proceedings of the 17th European conference on Machine Learning
ANN vs. SVM: Which one performs better in classification of MCCs in mammogram imaging

Knowledge-Based Systems
Predicting the need for CT imaging in children with minor head injury using an ensemble of Naive Bayes classifiers

Artificial Intelligence in Medicine
Machine Learning Methods For Detecting Patterns Of Management Fraud

Computational Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper argues that severe class imbalance is not just an interesting technical challenge that improved learning algorithms will address, it is much more serious. To be useful, a classifier must appreciably outperform a trivial solution, such as choosing the majority class. Any application that is inherently noisy limits the error rate, and cost, that is achievable. When data are normally distributed, even a Bayes optimal classifier has a vanishingly small reduction in the majority classifier's error rate, and cost, as imbalance increases. For fat tailed distributions, and when practical classifiers are used, often no reduction is achieved.