On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed

Authors:
Victoria López;Alberto Fernández;Francisco Herrera
Affiliations:
-;-;-
Venue:
Information Sciences: an International Journal
Year:
2014

Citing 37
Cited 0

Instance-Based Learning Algorithms

Machine Learning
Support-Vector Networks

Machine Learning
Learning in the presence of concept drift and hidden contexts

Machine Learning
MetaCost: a general method for making classifiers cost-sensitive

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning and making decisions when costs and probabilities are both unknown

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Fuzzy Algorithms: With Applications to Image Processing and Pattern Recognition

Fuzzy Algorithms: With Applications to Image Processing and Pattern Recognition
An Instance-Weighting Method to Induce Cost-Sensitive Trees

IEEE Transactions on Knowledge and Data Engineering
Cost-Sensitive Learning by Cost-Proportionate Example Weighting

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Mining with rarity: a unifying framework

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A study of the behavior of several methods for balancing machine learning training data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Classification and Modeling with Linguistic Information Granules: Advanced Approaches to Linguistic Data Mining (Advanced Information Processing)

Classification and Modeling with Linguistic Information Granules: Advanced Approaches to Linguistic Data Mining (Advanced Information Processing)
Using AUC and Accuracy in Evaluating Learning Algorithms

IEEE Transactions on Knowledge and Data Engineering
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Evolutionary stratified training set selection for extracting classification rules with trade off precision-interpretability

Data & Knowledge Engineering
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
The class imbalance problem: A systematic study

Intelligent Data Analysis
Maximizing classifier utility when there are data acquisition and modeling costs

Data Mining and Knowledge Discovery
Evolutionary rule-based systems for imbalanced data sets

Soft Computing - A Fusion of Foundations, Methodologies and Applications - Special Issue on Evolutionary and Metaheuristics based Data Mining (EMBDM); Guest Editors: José A. Gámez, María J. del Jesús, José M. Puerta
KEEL: a software tool to assess evolutionary algorithms for data mining problems

Soft Computing - A Fusion of Foundations, Methodologies and Applications - Special Issue on Evolutionary and Metaheuristics based Data Mining (EMBDM); Guest Editors: José A. Gámez, María J. del Jesús, José M. Puerta
Dataset Shift in Machine Learning

Dataset Shift in Machine Learning
Handbook of Parametric and Nonparametric Statistical Procedures

Handbook of Parametric and Nonparametric Statistical Procedures
A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability

Soft Computing - A Fusion of Foundations, Methodologies and Applications
Learning from Imbalanced Data

IEEE Transactions on Knowledge and Data Engineering
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research
SVMs modeling for highly imbalanced classification

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics - Special issue on human computing
A proposal on reasoning methods in fuzzy rule-based classification systems

International Journal of Approximate Reasoning
Facetwise analysis of XCS for problems with class imbalances

IEEE Transactions on Evolutionary Computation
The use of the area under the ROC curve in the evaluation of machine learning algorithms

Pattern Recognition
Assessing the impact of changing environments on classifier performance

Canadian AI'08 Proceedings of the Canadian Society for computational studies of intelligence, 21st conference on Advances in artificial intelligence
Combating the Small Sample Class Imbalance Problem Using Feature Selection

IEEE Transactions on Knowledge and Data Engineering
A unifying view on dataset shift in classification

Pattern Recognition
Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics

Expert Systems with Applications: An International Journal
Support vector learning for fuzzy rule-based classification systems

IEEE Transactions on Fuzzy Systems
Rule Weight Specification in Fuzzy Rule-Based Classification Systems

IEEE Transactions on Fuzzy Systems
Nearest neighbor pattern classification

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.07

Visualization

Abstract

In the field of Data Mining, the estimation of the quality of the learned models is a key step in order to select the most appropriate tool for the problem to be solved. Traditionally, a k-fold validation technique has been carried out so that there is a certain degree of independency among the results for the different partitions. In this way, the highest average performance will be obtained by the most robust approach. However, applying a ''random'' division of the instances over the folds may result in a problem known as dataset shift, which consists in having a different data distribution between the training and test folds. In classification with imbalanced datasets, in which the number of instances of one class is much lower than the other class, this problem is more severe. The misclassification of minority class instances due to an incorrect learning of the real boundaries caused by a not well fitted data distribution, truly affects the measures of performance in this scenario. Regarding this fact, we propose the use of a specific validation technique for the partitioning of the data, known as ''Distribution optimally balanced stratified cross-validation'' to avoid this harmful situation in the presence of imbalance. This methodology makes the decision of placing close-by samples on different folds, so that each partition will end up with enough representatives of every region. We have selected a wide number of imbalanced datasets from KEEL dataset repository for our study, using several learning techniques from different paradigms, thus making the conclusions extracted to be independent of the underlying classifier. The analysis of the results has been carried out by means of the proper statistical study, which shows the goodness of this approach for dealing with imbalanced data.