Training and assessing classification rules with imbalanced data

Authors:
Giovanna Menardi;Nicola Torelli
Affiliations:
Dipartimento di Scienze Statistiche, Università degli Studi di Padova, Padova, Italy;Dipartimento di Scienze Economiche, Aziendali, Matematiche e statistiche "Bruno de Finetti", Università degli Studi di Trieste, Trieste, Italy
Venue:
Data Mining and Knowledge Discovery
Year:
2014

Citing 31
Cited 0

Bagging predictors

Machine Learning
Noisy replication in skewed binary classification

Computational Statistics & Data Analysis
Support Vector Machines for Classification in Nonstandard Situations

Machine Learning
An Instance-Weighting Method to Induce Cost-Sensitive Trees

IEEE Transactions on Knowledge and Data Engineering
Choosing k for two-class nearest neighbour classifiers with unbalanced classes

Pattern Recognition Letters
Mining with rarity: a unifying framework

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A study of the behavior of several methods for balancing machine learning training data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Class imbalances versus small disjuncts

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Does cost-sensitive learning beat sampling for classifying rare classes?

UBDM '05 Proceedings of the 1st international workshop on Utility-based data mining
Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem

IEEE Transactions on Knowledge and Data Engineering
The relationship between Precision-Recall and ROC curves

ICML '06 Proceedings of the 23rd international conference on Machine learning
Cost curves: An improved method for visualizing classifier performance

Machine Learning
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
Boosted Classification Trees and Class Probability/Quantile Estimation

The Journal of Machine Learning Research
Cost-sensitive boosting for classification of imbalanced data

Pattern Recognition
The class imbalance problem: A systematic study

Intelligent Data Analysis
An Empirical Study of Learning from Imbalanced Data Using Random Forest

ICTAI '07 Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence - Volume 02
2008 Special Issue: Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance

Neural Networks
Learning Decision Trees for Unbalanced Data

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Handling class imbalance in customer churn prediction

Expert Systems with Applications: An International Journal
Learning from Imbalanced Data

IEEE Transactions on Knowledge and Data Engineering
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Exploratory undersampling for class-imbalance learning

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
FSVM-CIL: fuzzy support vector machines for class imbalance learning

IEEE Transactions on Fuzzy Systems - Special section on computing with words
Combating the Small Sample Class Imbalance Problem Using Feature Selection

IEEE Transactions on Knowledge and Data Engineering
A parallel neural network approach to prediction of Parkinson's Disease

Expert Systems with Applications: An International Journal
Evolutionary-based selection of generalized instances for imbalanced classification

Knowledge-Based Systems
Mitotic HEp-2 cells recognition under class skew

ICIAP'11 Proceedings of the 16th international conference on Image analysis and processing - Volume Part II
Optimisation and evaluation of random forests for imbalanced datasets

ISMIS'06 Proceedings of the 16th international conference on Foundations of Intelligent Systems
Application of bootstrap and other resampling techniques: Evaluation of classifier performance

Pattern Recognition Letters
Comparing Boosting and Bagging Techniques With Noisy and Imbalanced Data

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of modeling binary responses by using cross-sectional data has been addressed with a number of satisfying solutions that draw on both parametric and nonparametric methods. However, there exist many real situations where one of the two responses (usually the most interesting for the analysis) is rare. It has been largely reported that this class imbalance heavily compromises the process of learning, because the model tends to focus on the prevalent class and to ignore the rare events. However, not only the estimation of the classification model is affected by a skewed distribution of the classes, but also the evaluation of its accuracy is jeopardized, because the scarcity of data leads to poor estimates of the model's accuracy. In this work, the effects of class imbalance on model training and model assessing are discussed. Moreover, a unified and systematic framework for dealing with the problem of imbalanced classification is proposed, based on a smoothed bootstrap re-sampling technique. The proposed technique is founded on a sound theoretical basis and an extensive empirical study shows that it outperforms the main other remedies to face imbalanced learning problems.