Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics

Authors:
Victoria López;Alberto Fernández;Jose G. Moreno-Torres;Francisco Herrera
Affiliations:
Dept. of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, 18071 Granada, Spain;Dept. of Computer Science, University of Jaén, 23071 Jaén, Spain;Dept. of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, 18071 Granada, Spain;Dept. of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, 18071 Granada, Spain
Venue:
Expert Systems with Applications: An International Journal
Year:
2012

Citing 56
Cited 5

Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system

Neural Networks
Support-Vector Networks

Machine Learning
MetaCost: a general method for making classifiers cost-sensitive

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning and making decisions when costs and probabilities are both unknown

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Complexity Measures of Supervised Classification Problems

IEEE Transactions on Pattern Analysis and Machine Intelligence
An Instance-Weighting Method to Induce Cost-Sensitive Trees

IEEE Transactions on Knowledge and Data Engineering
Pruning Decision Trees with Misclassification Costs

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Improving Identification of Difficult Small Classes by Balancing Class Distribution

AIME '01 Proceedings of the 8th Conference on AI in Medicine in Europe: Artificial Intelligence Medicine
Choosing k for two-class nearest neighbour classifiers with unbalanced classes

Pattern Recognition Letters
Cost-Sensitive Learning by Cost-Proportionate Example Weighting

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Functional Trees

Machine Learning
Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A study of the behavior of several methods for balancing machine learning training data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Decision trees with minimal costs

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Using AUC and Accuracy in Evaluating Learning Algorithms

IEEE Transactions on Knowledge and Data Engineering
KBA: Kernel Boundary Alignment Considering Imbalanced Data Distribution

IEEE Transactions on Knowledge and Data Engineering
Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem

IEEE Transactions on Knowledge and Data Engineering
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
Cost-sensitive boosting for classification of imbalanced data

Pattern Recognition
The class imbalance problem: A systematic study

Intelligent Data Analysis
An information granulation based data mining approach for classifying imbalanced data

Information Sciences: an International Journal
Covariate Shift Adaptation by Importance Weighted Cross Validation

The Journal of Machine Learning Research
A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets

Fuzzy Sets and Systems
Automatically countering imbalance and its empirical relationship to cost

Data Mining and Knowledge Discovery
On the k-NN performance in a challenging scenario of imbalance and overlapping

Pattern Analysis & Applications - Special Issue: Non-parametric distance-based classification techniques and their applications
Conceptual equivalence for contrast mining in classification learning

Data & Knowledge Engineering
Evolutionary rule-based systems for imbalanced data sets

Soft Computing - A Fusion of Foundations, Methodologies and Applications - Special Issue on Evolutionary and Metaheuristics based Data Mining (EMBDM); Guest Editors: José A. Gámez, María J. del Jesús, José M. Puerta
KEEL: a software tool to assess evolutionary algorithms for data mining problems

Soft Computing - A Fusion of Foundations, Methodologies and Applications - Special Issue on Evolutionary and Metaheuristics based Data Mining (EMBDM); Guest Editors: José A. Gámez, María J. del Jesús, José M. Puerta
Dataset Shift in Machine Learning

Dataset Shift in Machine Learning
A framework for monitoring classifiers’ performance: when and why failure occurs?

Knowledge and Information Systems
Hierarchical fuzzy rule based classification systems with genetic rule selection for imbalanced data-sets

International Journal of Approximate Reasoning
Handbook of Parametric and Nonparametric Statistical Procedures

Handbook of Parametric and Nonparametric Statistical Procedures
A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability

Soft Computing - A Fusion of Foundations, Methodologies and Applications
Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Improving Classification under Changes in Class and Within-Class Distributions

IWANN '09 Proceedings of the 10th International Work-Conference on Artificial Neural Networks: Part I: Bio-Inspired Systems: Computational and Ambient Intelligence
Learning from Imbalanced Data

IEEE Transactions on Knowledge and Data Engineering
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research
The foundations of cost-sensitive learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets

Information Sciences: an International Journal
Multi-objective genetic fuzzy classifiers for imbalanced and cost-sensitive datasets

Soft Computing - A Fusion of Foundations, Methodologies and Applications
The use of the area under the ROC curve in the evaluation of machine learning algorithms

Pattern Recognition
Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power

Information Sciences: an International Journal
Discriminative Learning Under Covariate Shift

The Journal of Machine Learning Research
Assessing the impact of changing environments on classifier performance

Canadian AI'08 Proceedings of the Canadian Society for computational studies of intelligence, 21st conference on Advances in artificial intelligence
Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets

Expert Systems with Applications: An International Journal
On the suitability of combining feature selection and resampling to manage data complexity

CAEPIA'09 Proceedings of the Current topics in artificial intelligence, and 13th conference on Spanish association for artificial intelligence
Induction and pruning of classification rules for prediction of microseismic hazards in coal mines

Expert Systems with Applications: An International Journal
Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling

Soft Computing - A Fusion of Foundations, Methodologies and Applications - Special Issue on Intelligent Systems, Design and Applications (ISDA 2009)
Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I
Overlap versus imbalance

AI'10 Proceedings of the 23rd Canadian conference on Advances in Artificial Intelligence
Hybridization of fuzzy GBML approaches for pattern classification problems

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Effect of rule weights in fuzzy rule-based classification systems

IEEE Transactions on Fuzzy Systems
Rule Weight Specification in Fuzzy Rule-Based Classification Systems

IEEE Transactions on Fuzzy Systems
Nearest neighbor pattern classification

IEEE Transactions on Information Theory
Repairing fractures between data using genetic programming-based feature extraction: A case study in cancer diagnosis

Information Sciences: an International Journal

A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets

Knowledge-Based Systems
Class imbalance and the curse of minority hubs

Knowledge-Based Systems
Addressing imbalanced classification with instance generation techniques: IPADE-ID

Neurocomputing
Cost-sensitive decision tree ensembles for effective imbalanced classification

Applied Soft Computing
On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed

Information Sciences: an International Journal

Quantified Score

Hi-index	12.05

Visualization

Abstract

Class imbalance is among the most persistent complications which may confront the traditional supervised learning task in real-world applications. The problem occurs, in the binary case, when the number of instances in one class significantly outnumbers the number of instances in the other class. This situation is a handicap when trying to identify the minority class, as the learning algorithms are not usually adapted to such characteristics. The approaches to deal with the problem of imbalanced datasets fall into two major categories: data sampling and algorithmic modification. Cost-sensitive learning solutions incorporating both the data and algorithm level approaches assume higher misclassification costs with samples in the minority class and seek to minimize high cost errors. Nevertheless, there is not a full exhaustive comparison between those models which can help us to determine the most appropriate one under different scenarios. The main objective of this work is to analyze the performance of data level proposals against algorithm level proposals focusing in cost-sensitive models and versus a hybrid procedure that combines those two approaches. We will show, by means of a statistical comparative analysis, that we cannot highlight an unique approach among the rest. This will lead to a discussion about the data intrinsic characteristics of the imbalanced classification problem which will help to follow new paths that can lead to the improvement of current models mainly focusing on class overlap and dataset shift in imbalanced classification.