Machine Learning
Lazy learning
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss
Machine Learning - Special issue on learning with probabilistic representations
The Random Subspace Method for Constructing Decision Forests
IEEE Transactions on Pattern Analysis and Machine Intelligence
Advances in kernel methods: support vector learning
Advances in kernel methods: support vector learning
Robust Classification for Imprecise Environments
Machine Learning
Machine Learning
Pattern Recognition and Neural Networks
Pattern Recognition and Neural Networks
Machine Learning
Classification of Fault-Prone Software Modules: Prior Probabilities,Costs, and Model Evaluation
Empirical Software Engineering
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Experiments with Noise Filtering in a Medical Domain
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Comparative Assessment of Software Quality Classification Techniques: An Empirical Case Study
Empirical Software Engineering
Editorial: special issue on learning from imbalanced data sets
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Class imbalances versus small disjuncts
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Class Noise vs. Attribute Noise: A Quantitative Study
Artificial Intelligence Review
A Hybrid Approach to Cleansing Software Measurement Data
ICTAI '06 Proceedings of the 18th IEEE International Conference on Tools with Artificial Intelligence
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Enhancing software quality estimation using ensemble-classifier based noise filtering
Intelligent Data Analysis
The class imbalance problem: A systematic study
Intelligent Data Analysis
Fast learning in networks of locally-tuned processing units
Neural Computation
An information granulation based data mining approach for classifying imbalanced data
Information Sciences: an International Journal
SMOTE: synthetic minority over-sampling technique
Journal of Artificial Intelligence Research
Learning when training data are costly: the effect of class distribution on tree induction
Journal of Artificial Intelligence Research
The foundations of cost-sensitive learning
IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning
ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I
Information Sciences: an International Journal
Hi-index | 0.07 |
Data mining techniques are commonly used to construct models for identifying software modules that are most likely to contain faults. In doing so, an organization's limited resources can be intelligently allocated with the goal of detecting and correcting the greatest number of faults. However, there are two characteristics of software quality datasets that can negatively impact the effectiveness of these models: class imbalance and class noise. Software quality datasets are, by their nature, imbalanced. That is, most of a software system's faults can be found in a small percentage of software modules. Therefore, the number of fault-prone, fp, examples (program modules) in a software project dataset is much smaller than the number of not fault-prone, nfp, examples. Data sampling techniques attempt to alleviate the problem of class imbalance by altering a training dataset's distribution. A program module contains class noise if it is incorrectly labeled. While several studies have been performed to evaluate data sampling methods, the impact of class noise on these techniques has not been adequately addressed. This work presents a systematic set of experiments designed to investigate the impact of both class noise and class imbalance on classification models constructed to identify fault-prone program modules. We analyze the impact of class noise and class imbalance on 11 different learning algorithms (learners) as well as 7 different data sampling techniques. We identify which learners and which data sampling techniques are most robust when confronted with noisy and imbalanced data.