Machine Learning
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss
Machine Learning - Special issue on learning with probabilistic representations
The Random Subspace Method for Constructing Decision Forests
IEEE Transactions on Pattern Analysis and Machine Intelligence
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Experiments with Noise Filtering in a Medical Domain
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Mining with rarity: a unifying framework
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Cost-Guided Class Noise Handling for Effective Cost-Sensitive Learning
ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Class noise vs. attribute noise: a quantitative study of their impacts
Artificial Intelligence Review
IEEE Transactions on Knowledge and Data Engineering
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Experimental perspectives on learning from imbalanced data
Proceedings of the 24th international conference on Machine learning
Skewed Class Distributions and Mislabeled Examples
ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
A Comparative Study of Data Sampling and Cost Sensitive Learning
ICDMW '08 Proceedings of the 2008 IEEE International Conference on Data Mining Workshops
IEEE Transactions on Knowledge and Data Engineering
Class noise detection using frequent itemsets
Intelligent Data Analysis
SMOTE: synthetic minority over-sampling technique
Journal of Artificial Intelligence Research
Learning when training data are costly: the effect of class distribution on tree induction
Journal of Artificial Intelligence Research
An empirical study of the noise impact on cost-sensitive learning
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning
ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I
Robust classification of imbalanced data using one-class and two-class SVM-based multiclassifiers
Intelligent Data Analysis - Business Analytics and Intelligent Optimization
Hi-index | 0.00 |
Much of the research literature in data mining and machine learning has focused on developing classification models for various application-specific learning tasks. In contrast, the characteristics of the underlying data, and their impacts on learning, have received much less attention. While it is generally understood that imbalanced, noisy and relatively small datasets make classification tasks more difficult, there has been, to our knowledge, no comprehensive examination of the impacts of these important and commonly-encountered dataset characteristics on the learning process. In this work, we present a comprehensive empirical analysis of learning from imbalanced, limited and noisy data. We present the performance of 11 commonly used learning algorithms and the effects of dataset size, class distribution, noise level and noise distribution on each learner. In this work, for which over one million classification models were built, we identify which learners are most robust to changing each of these experimental factors using two different performance metrics. Our results show that each of these factors plays a critical role in learner performance, with some learners exhibiting much greater stability than others.