An empirical comparison of repetitive undersampling techniques

Authors:
Jason Van Hulse;Taghi M. Khoshgoftaar;Amri Napolitano
Affiliations:
Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL;Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL;Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL
Venue:
IRI'09 Proceedings of the 10th IEEE international conference on Information Reuse & Integration
Year:
2009

Citing 16
Cited 2

Advances in kernel methods: support vector learning

Advances in kernel methods: support vector learning
Robust Classification for Imprecise Environments

Machine Learning
Mining with rarity: a unifying framework

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
An Unsupervised Learning Approach to Resolving the Data Imbalanced Issue in Supervised Learning Problems in Functional Genomics

HIS '05 Proceedings of the Fifth International Conference on Hybrid Intelligent Systems
The relationship between Precision-Recall and ROC curves

ICML '06 Proceedings of the 23rd international conference on Machine learning
Precision-recall operating characteristic (P-ROC) curves in imprecise environments

ICPR '06 Proceedings of the 18th International Conference on Pattern Recognition - Volume 04
Exploratory Under-Sampling for Class-Imbalance Learning

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Experimental perspectives on learning from imbalanced data

Proceedings of the 24th international conference on Machine learning
The class imbalance problem: A systematic study

Intelligent Data Analysis
Boosting the Performance of Web Spam Detection with Ensemble Under-Sampling Classification

FSKD '07 Proceedings of the Fourth International Conference on Fuzzy Systems and Knowledge Discovery - Volume 04
Facing Imbalanced Classes through Aggregation of Classifiers

ICIAP '07 Proceedings of the 14th International Conference on Image Analysis and Processing
Improving Learner Performance with Data Sampling and Boosting

ICTAI '08 Proceedings of the 2008 20th IEEE International Conference on Tools with Artificial Intelligence - Volume 01
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I
EUS SVMs: ensemble of under-sampled SVMs for data imbalance problems

ICONIP'06 Proceedings of the 13 international conference on Neural Information Processing - Volume Part I

Data preparation techniques for improving rare class prediction

MAMECTIS/NOLASC/CONTROL/WAMUS'11 Proceedings of the 13th WSEAS international conference on mathematical methods, computational techniques and intelligent systems, and 10th WSEAS international conference on non-linear analysis, non-linear systems and chaos, and 7th WSEAS international conference on dynamical systems and control, and 11th WSEAS international conference on Wavelet analysis and multirate systems: recent researches in computational techniques, non-linear systems and control
Churn prediction in telecom using Random Forest and PSO based data balancing in combination with various feature selection strategies

Computers and Electrical Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

A common problem for data mining and machine learning practitioners is class imbalance. When examples of one class greatly outnumber examples of the other class(es), traditional machine learning algorithms can perform poorly. Random undersampling is a technique that has shown great potential for alleviating the problem of class imbalance. However, undersampling leads to information loss which can hinder classification performance in some cases. To overcome this problem, repetitive undersampling techniques have been proposed. These techniques generate an ensemble of models, each trained on a different, undersampled subset of the training data. In doing so, less information is lost and classification performance is improved. In this study, we evaluate the performance of several repetitive undersampling techniques. To our knowledge, no study has so thoroughly compared repetitive undersampling techniques.