A Comparative Study of Data Sampling and Cost Sensitive Learning

  • Authors:
  • Chris Seiffert;Taghi M. Khoshgoftaar;Jason Van Hulse;Amri Napolitano

  • Affiliations:
  • -;-;-;-

  • Venue:
  • ICDMW '08 Proceedings of the 2008 IEEE International Conference on Data Mining Workshops
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Two common challenges data mining and machine learning practitioners face in many application domains are unequal classification costs and class imbalance. Most traditional data mining techniques attempt to maximize overall accuracy rather than minimize cost. When data is imbalanced, such techniques result in models that highly favor the overrepresented class, the class which typically carries a lower cost of misclassification. Two techniques that have been used to address both of these issues are cost sensitive learning and data sampling. In this work, we investigate the performance of two cost sensitive learning techniques and four data sampling techniques for minimizing classification costs when data is imbalanced. We present a comprehensive suite of experiments, utilizing 15 datasets with 10 cost ratios, which have been carefully designed to ensure conclusive, significant and reliable results.