Hybrid sampling for imbalanced data

Authors:
Chris Seiffert;Taghi M. Khoshgoftaar;Jason Van Hulse
Affiliations:
Data Mining and Machine Learning Laboratory, Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL, USA;(Correspd. Tel.: +1 561 297 3994/ Fax: +1 561 297 2800/ E-mail: taghi@cse.fau.edu) Data Mining and Machine Learning Laboratory, Department of Computer Science and Engineering, Florida Atlantic Uni ...;Data Mining and Machine Learning Laboratory, Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL, USA
Venue:
Integrated Computer-Aided Engineering - Selected papers from the IEEE Conference on Information Reuse and Integration (IRI), July 13-15, 2008
Year:
2009

Citing 19
Cited 4

C4.5: programs for machine learning

C4.5: programs for machine learning
Lazy learning

Lazy learning
Advances in kernel methods: support vector learning

Advances in kernel methods: support vector learning
Machine Learning

Machine Learning
Random Forests

Machine Learning
Classification of Fault-Prone Software Modules: Prior Probabilities,Costs, and Model Evaluation

Empirical Software Engineering
A Mixture-of-Experts Framework for Learning from Imbalanced Data Sets

IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
A study of the behavior of several methods for balancing machine learning training data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Detecting noisy instances with the rule-based classification model

Intelligent Data Analysis
An Evaluation of Progressive Sampling for Imbalanced Data Sets

ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops
Experimental perspectives on learning from imbalanced data

Proceedings of the 24th international conference on Machine learning
The class imbalance problem: A systematic study

Intelligent Data Analysis
Learning with Limited Minority Class Data

ICMLA '07 Proceedings of the Sixth International Conference on Machine Learning and Applications
Fast learning in networks of locally-tuned processing units

Neural Computation
Pattern Recognition and Neural Networks

Pattern Recognition and Neural Networks
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research
Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I

Automatic line and word segmentation applied to densely line-skewed historical handwritten document images

Integrated Computer-Aided Engineering
Predicting high-risk program modules by selecting the right software measurements

Software Quality Control
Understanding risk factors in cardiac rehabilitation patients with random forests and decision trees

AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121
A scatter method for data and variable importance evaluation

Integrated Computer-Aided Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Building a classification model on imbalanced datasets can be a challenging endeavor. Models built on data where examples of one class are greatly outnumbered by examples of the other class(es) tend to sacrifice accuracy with respect to the underrepresented class in favor of maximizing the overall classification rate. Several methods have been suggested to alleviate the problem of class imbalance. One common technique that has received much attention in recent research is data sampling. Data sampling either adds examples to the minority class (oversampling) or removes examples from the majority class (undersampling) in order to create a more balanced data set. Both oversampling and undersampling have their strengths and drawbacks. In this work we propose a hybrid sampling procedure that uses a combination of two sampling techniques to create a balanced data set. By using more than one sampling technique, we can combine the strengths of the individual techniques while lessening the drawbacks. We perform a comprehensive set of experiments, with more than one million classifiers built, showing that our hybrid sampling procedure almost always outperforms the individual sampling techniques.