C4.5: programs for machine learning
C4.5: programs for machine learning
Lazy learning
Advances in kernel methods: support vector learning
Advances in kernel methods: support vector learning
Machine Learning
Machine Learning
Classification of Fault-Prone Software Modules: Prior Probabilities,Costs, and Model Evaluation
Empirical Software Engineering
A Mixture-of-Experts Framework for Learning from Imbalanced Data Sets
IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
A study of the behavior of several methods for balancing machine learning training data
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Detecting noisy instances with the rule-based classification model
Intelligent Data Analysis
An Evaluation of Progressive Sampling for Imbalanced Data Sets
ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops
Experimental perspectives on learning from imbalanced data
Proceedings of the 24th international conference on Machine learning
The class imbalance problem: A systematic study
Intelligent Data Analysis
Learning with Limited Minority Class Data
ICMLA '07 Proceedings of the Sixth International Conference on Machine Learning and Applications
Fast learning in networks of locally-tuned processing units
Neural Computation
Pattern Recognition and Neural Networks
Pattern Recognition and Neural Networks
SMOTE: synthetic minority over-sampling technique
Journal of Artificial Intelligence Research
Learning when training data are costly: the effect of class distribution on tree induction
Journal of Artificial Intelligence Research
Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning
ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I
Integrated Computer-Aided Engineering
Predicting high-risk program modules by selecting the right software measurements
Software Quality Control
Understanding risk factors in cardiac rehabilitation patients with random forests and decision trees
AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121
A scatter method for data and variable importance evaluation
Integrated Computer-Aided Engineering
Hi-index | 0.00 |
Building a classification model on imbalanced datasets can be a challenging endeavor. Models built on data where examples of one class are greatly outnumbered by examples of the other class(es) tend to sacrifice accuracy with respect to the underrepresented class in favor of maximizing the overall classification rate. Several methods have been suggested to alleviate the problem of class imbalance. One common technique that has received much attention in recent research is data sampling. Data sampling either adds examples to the minority class (oversampling) or removes examples from the majority class (undersampling) in order to create a more balanced data set. Both oversampling and undersampling have their strengths and drawbacks. In this work we propose a hybrid sampling procedure that uses a combination of two sampling techniques to create a balanced data set. By using more than one sampling technique, we can combine the strengths of the individual techniques while lessening the drawbacks. We perform a comprehensive set of experiments, with more than one million classifiers built, showing that our hybrid sampling procedure almost always outperforms the individual sampling techniques.