Hybrid sampling for imbalanced data

  • Authors:
  • Chris Seiffert;Taghi M. Khoshgoftaar;Jason Van Hulse

  • Affiliations:
  • Data Mining and Machine Learning Laboratory, Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL, USA;(Correspd. Tel.: +1 561 297 3994/ Fax: +1 561 297 2800/ E-mail: taghi@cse.fau.edu) Data Mining and Machine Learning Laboratory, Department of Computer Science and Engineering, Florida Atlantic Uni ...;Data Mining and Machine Learning Laboratory, Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL, USA

  • Venue:
  • Integrated Computer-Aided Engineering - Selected papers from the IEEE Conference on Information Reuse and Integration (IRI), July 13-15, 2008
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Building a classification model on imbalanced datasets can be a challenging endeavor. Models built on data where examples of one class are greatly outnumbered by examples of the other class(es) tend to sacrifice accuracy with respect to the underrepresented class in favor of maximizing the overall classification rate. Several methods have been suggested to alleviate the problem of class imbalance. One common technique that has received much attention in recent research is data sampling. Data sampling either adds examples to the minority class (oversampling) or removes examples from the majority class (undersampling) in order to create a more balanced data set. Both oversampling and undersampling have their strengths and drawbacks. In this work we propose a hybrid sampling procedure that uses a combination of two sampling techniques to create a balanced data set. By using more than one sampling technique, we can combine the strengths of the individual techniques while lessening the drawbacks. We perform a comprehensive set of experiments, with more than one million classifiers built, showing that our hybrid sampling procedure almost always outperforms the individual sampling techniques.