On the effectiveness of preprocessing methods when dealing with different levels of class imbalance

Authors:
V. García;J. S. Sánchez;R. A. Mollineda
Affiliations:
Institute of New Imaging Technologies, Dept. Llenguatges i Sistemes Informítics, Universitat Jaume I, Av. Sos Baynat s/n, 12071 Castelló de la Plana, Spain;Institute of New Imaging Technologies, Dept. Llenguatges i Sistemes Informítics, Universitat Jaume I, Av. Sos Baynat s/n, 12071 Castelló de la Plana, Spain;Institute of New Imaging Technologies, Dept. Llenguatges i Sistemes Informítics, Universitat Jaume I, Av. Sos Baynat s/n, 12071 Castelló de la Plana, Spain
Venue:
Knowledge-Based Systems
Year:
2012

Citing 40
Cited 9

Machine Learning for the Detection of Oil Spills in Satellite Radar Images

Machine Learning - Special issue on applications of machine learning and the knowledge discovery process
Robust Classification for Imprecise Environments

Machine Learning
Information Retrieval

Information Retrieval
Adaptive Fraud Detection

Data Mining and Knowledge Discovery
Distributed Data Mining in Credit Card Fraud Detection

IEEE Intelligent Systems
Improving Identification of Difficult Small Classes by Balancing Class Distribution

AIME '01 Proceedings of the 8th Conference on AI in Medicine in Europe: Artificial Intelligence Medicine
A study of the behavior of several methods for balancing machine learning training data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Class imbalances versus small disjuncts

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Feature selection for text categorization on imbalanced data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Data mining in metric space: an empirical analysis of supervised learning performance criteria

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Using AUC and Accuracy in Evaluating Learning Algorithms

IEEE Transactions on Knowledge and Data Engineering
Experimental perspectives on learning from imbalanced data

Proceedings of the 24th international conference on Machine learning
The class imbalance problem: A systematic study

Intelligent Data Analysis
On the k-NN performance in a challenging scenario of imbalance and overlapping

Pattern Analysis & Applications - Special Issue: Non-parametric distance-based classification techniques and their applications
An application of supervised and unsupervised learning approaches to telecommunications fraud detection

Knowledge-Based Systems
A Visualization-Based Exploratory Technique for Classifier Comparison with Respect to Multiple Metrics and Multiple Domains

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
A comparative study on rough set based class imbalance learning

Knowledge-Based Systems
On the use of surrounding neighbors for synthetic over-sampling of the minority class

SMO'08 Proceedings of the 8th conference on Simulation, modelling and optimization
Learning from Imbalanced Data

IEEE Transactions on Knowledge and Data Engineering
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research
Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy

Evolutionary Computation
Learning from imbalanced data in surveillance of nosocomial infection

Artificial Intelligence in Medicine
Knowledge discovery from imbalanced and noisy data

Data & Knowledge Engineering
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Exploratory undersampling for class-imbalance learning

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
MSMOTE: Improving Classification Performance When Training Data is Imbalanced

IWCSE '09 Proceedings of the 2009 Second International Workshop on Computer Science and Engineering - Volume 02
Neighbor-weighted K-nearest neighbor for unbalanced text corpus

Expert Systems with Applications: An International Journal
Evolutionary data analysis for the class imbalance problem

Intelligent Data Analysis
The use of the area under the ROC curve in the evaluation of machine learning algorithms

Pattern Recognition
Theoretical Analysis of a Performance Measure for Imbalanced Data

ICPR '10 Proceedings of the 2010 20th International Conference on Pattern Recognition
A novel virtual sample generation method based on Gaussian distribution

Knowledge-Based Systems
Evolutionary-based selection of generalized instances for imbalanced classification

Knowledge-Based Systems
Class imbalance methods for translation initiation site recognition in DNA sequences

Knowledge-Based Systems
Boosting prediction accuracy on imbalanced datasets with SVM ensembles

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling

Soft Computing - A Fusion of Foundations, Methodologies and Applications - Special Issue on Intelligent Systems, Design and Applications (ISDA 2009)
Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I
Beyond accuracy, f-score and ROC: a family of discriminant measures for performance evaluation

AI'06 Proceedings of the 19th Australian joint conference on Artificial Intelligence: advances in Artificial Intelligence
The condensed nearest neighbor rule (Corresp.)

IEEE Transactions on Information Theory

ANN vs. SVM: Which one performs better in classification of MCCs in mammogram imaging

Knowledge-Based Systems
A hybrid generative/discriminative method for semi-supervised classification

Knowledge-Based Systems
A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets

Knowledge-Based Systems
Multiple extreme learning machines for a two-class imbalance corporate life cycle prediction

Knowledge-Based Systems
An enhanced Customer Relationship Management classification framework with Partial Focus Feature Reduction

Expert Systems with Applications: An International Journal
Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods

Knowledge-Based Systems
Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches

Knowledge-Based Systems
GSVM: An SVM for handling imbalanced accuracy between classes inbi-classification problems

Applied Soft Computing
Influence of class distribution on cost-sensitive learning: A case study of bankruptcy analysis

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

The present paper investigates the influence of both the imbalance ratio and the classifier on the performance of several resampling strategies to deal with imbalanced data sets. The study focuses on evaluating how learning is affected when different resampling algorithms transform the originally imbalanced data into artificially balanced class distributions. Experiments over 17 real data sets using eight different classifiers, four resampling algorithms and four performance evaluation measures show that over-sampling the minority class consistently outperforms under-sampling the majority class when data sets are strongly imbalanced, whereas there are not significant differences for databases with a low imbalance. Results also indicate that the classifier has a very poor influence on the effectiveness of the resampling strategies.