Wrapper-based computation and evaluation of sampling methods for imbalanced datasets

Authors:
Nitesh V. Chawla;Lawrence O. Hall;Ajay Joshi
Affiliations:
University of Notre Dame, Notre Dame, IN;University of South Florida, Tampa, FL;University of South Florida, Tampa, FL
Venue:
UBDM '05 Proceedings of the 1st international workshop on Utility-based data mining
Year:
2005

Citing 15
Cited 3

C4.5: programs for machine learning

C4.5: programs for machine learning
The relationship between recall and precision

Journal of the American Society for Information Science
Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
Machine Learning for the Detection of Oil Spills in Satellite Radar Images

Machine Learning - Special issue on applications of machine learning and the knowledge discovery process
Explicitly representing expected cost: an alternative to ROC representation

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning and making decisions when costs and probabilities are both unknown

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Distributed Data Mining in Credit Card Fraud Detection

IEEE Intelligent Systems
Toward a Query Language on Simulation Mesh Data: An Object-oriented Approach

DASFAA '01 Proceedings of the 7th International Conference on Database Systems for Advanced Applications
Naive Bayes vs decision trees in intrusion detection systems

Proceedings of the 2004 ACM symposium on Applied computing
Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A study of the behavior of several methods for balancing machine learning training data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
The class imbalance problem: A systematic study

Intelligent Data Analysis
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research
Ensembles of classifiers from spatially disjoint data

MCS'05 Proceedings of the 6th international conference on Multiple Classifier Systems

Automatically countering imbalance and its empirical relationship to cost

Data Mining and Knowledge Discovery
Cost-sensitive classifier evaluation using cost curves

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
A new probabilistic active sample selection algorithm for class imbalance problem

International Journal of Knowledge Engineering and Soft Data Paradigms

Quantified Score

Hi-index	0.00

Visualization

Abstract

Learning from imbalanced datasets presents an interesting problem both from modeling and economy standpoints. When the imbalance is large, classification accuracy on the smaller class(es) tends to be lower. In particular, when a class is of great interest but occurs relatively rarely such as cases of fraud, instances of disease, and regions of interest in largescale simulations, it is important to accurately identify it. It then becomes more costly to misclassify the interesting class. In this paper, we implement a wrapper approach that computes the amount of under-sampling and synthetic generation of the minority class examples (SMOTE) to improve minority class accuracy. The f-value serves as the evaluation function. Experimental results show the wrapper approach is effective in optimization of the composite f-value, and reduces the average cost per test example for the datasets considered. We report both average cost per test example and the cost curves in the paper. The true positive rate of the minority class increases significantly without causing a significant change in the f-value. We also obtain the lowest cost per test example, compared to any result we are aware of for the KDD Cup-99 intrusion detection data set.