Wrapper-based computation and evaluation of sampling methods for imbalanced datasets

  • Authors:
  • Nitesh V. Chawla;Lawrence O. Hall;Ajay Joshi

  • Affiliations:
  • University of Notre Dame, Notre Dame, IN;University of South Florida, Tampa, FL;University of South Florida, Tampa, FL

  • Venue:
  • UBDM '05 Proceedings of the 1st international workshop on Utility-based data mining
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Learning from imbalanced datasets presents an interesting problem both from modeling and economy standpoints. When the imbalance is large, classification accuracy on the smaller class(es) tends to be lower. In particular, when a class is of great interest but occurs relatively rarely such as cases of fraud, instances of disease, and regions of interest in largescale simulations, it is important to accurately identify it. It then becomes more costly to misclassify the interesting class. In this paper, we implement a wrapper approach that computes the amount of under-sampling and synthetic generation of the minority class examples (SMOTE) to improve minority class accuracy. The f-value serves as the evaluation function. Experimental results show the wrapper approach is effective in optimization of the composite f-value, and reduces the average cost per test example for the datasets considered. We report both average cost per test example and the cost curves in the paper. The true positive rate of the minority class increases significantly without causing a significant change in the f-value. We also obtain the lowest cost per test example, compared to any result we are aware of for the KDD Cup-99 intrusion detection data set.