An exploration of learning when data is noisy and imbalanced

Authors:
Jason Van Hulse;Taghi M. Khoshgoftaar;Amri Napolitano
Affiliations:
Florida Atlantic University, Boca Raton, Florida, USA;(Correspd. Tel.: +1 561 297 3994/ Fax: +1 561 297 2800/ E-mail: taghi@cse.fau.edu) Florida Atlantic University, Boca Raton, Florida, USA;Florida Atlantic University, Boca Raton, Florida, USA
Venue:
Intelligent Data Analysis
Year:
2011

Citing 19
Cited 2

Bagging predictors

Machine Learning
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
The Random Subspace Method for Constructing Decision Forests

IEEE Transactions on Pattern Analysis and Machine Intelligence
Correcting Noisy Data

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Experiments with Noise Filtering in a Medical Domain

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Mining with rarity: a unifying framework

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Cost-Guided Class Noise Handling for Effective Cost-Sensitive Learning

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Class noise vs. attribute noise: a quantitative study of their impacts

Artificial Intelligence Review
Class Noise Handling for Effective Cost-Sensitive Learning by Cost-Guided Iterative Classification Filtering

IEEE Transactions on Knowledge and Data Engineering
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Experimental perspectives on learning from imbalanced data

Proceedings of the 24th international conference on Machine learning
Skewed Class Distributions and Mislabeled Examples

ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
A Comparative Study of Data Sampling and Cost Sensitive Learning

ICDMW '08 Proceedings of the 2008 IEEE International Conference on Data Mining Workshops
Learning from Imbalanced Data

IEEE Transactions on Knowledge and Data Engineering
Class noise detection using frequent itemsets

Intelligent Data Analysis
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research
An empirical study of the noise impact on cost-sensitive learning

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I

Imbalanced data classification using second-order cone programming support vector machines

Pattern Recognition
Robust classification of imbalanced data using one-class and two-class SVM-based multiclassifiers

Intelligent Data Analysis - Business Analytics and Intelligent Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

Much of the research literature in data mining and machine learning has focused on developing classification models for various application-specific learning tasks. In contrast, the characteristics of the underlying data, and their impacts on learning, have received much less attention. While it is generally understood that imbalanced, noisy and relatively small datasets make classification tasks more difficult, there has been, to our knowledge, no comprehensive examination of the impacts of these important and commonly-encountered dataset characteristics on the learning process. In this work, we present a comprehensive empirical analysis of learning from imbalanced, limited and noisy data. We present the performance of 11 commonly used learning algorithms and the effects of dataset size, class distribution, noise level and noise distribution on each learner. In this work, for which over one million classification models were built, we identify which learners are most robust to changing each of these experimental factors using two different performance metrics. Our results show that each of these factors plays a critical role in learner performance, with some learners exhibiting much greater stability than others.