An empirical study of the classification performance of learners on imbalanced and noisy software quality data

Authors:
Chris Seiffert;Taghi M. Khoshgoftaar;Jason Van Hulse;Andres Folleco
Affiliations:
Florida Atlantic University, Boca Raton, FL 33431, USA;Florida Atlantic University, Boca Raton, FL 33431, USA;Florida Atlantic University, Boca Raton, FL 33431, USA;Florida Atlantic University, Boca Raton, FL 33431, USA
Venue:
Information Sciences: an International Journal
Year:
2014

Citing 26
Cited 1

Bagging predictors

Machine Learning
Lazy learning

Lazy learning
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
The Random Subspace Method for Constructing Decision Forests

IEEE Transactions on Pattern Analysis and Machine Intelligence
Advances in kernel methods: support vector learning

Advances in kernel methods: support vector learning
Robust Classification for Imprecise Environments

Machine Learning
Machine Learning

Machine Learning
Pattern Recognition and Neural Networks

Pattern Recognition and Neural Networks
Random Forests

Machine Learning
Classification of Fault-Prone Software Modules: Prior Probabilities,Costs, and Model Evaluation

Empirical Software Engineering
Correcting Noisy Data

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Experiments with Noise Filtering in a Medical Domain

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Comparative Assessment of Software Quality Classification Techniques: An Empirical Case Study

Empirical Software Engineering
Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Class imbalances versus small disjuncts

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Class Noise vs. Attribute Noise: A Quantitative Study

Artificial Intelligence Review
A Hybrid Approach to Cleansing Software Measurement Data

ICTAI '06 Proceedings of the 18th IEEE International Conference on Tools with Artificial Intelligence
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Enhancing software quality estimation using ensemble-classifier based noise filtering

Intelligent Data Analysis
The class imbalance problem: A systematic study

Intelligent Data Analysis
Fast learning in networks of locally-tuned processing units

Neural Computation
An information granulation based data mining approach for classifying imbalanced data

Information Sciences: an International Journal
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research
The foundations of cost-sensitive learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I

Discovering high quality answers in community question answering archives using a hierarchy of classifiers

Information Sciences: an International Journal

Quantified Score

Hi-index	0.07

Visualization

Abstract

Data mining techniques are commonly used to construct models for identifying software modules that are most likely to contain faults. In doing so, an organization's limited resources can be intelligently allocated with the goal of detecting and correcting the greatest number of faults. However, there are two characteristics of software quality datasets that can negatively impact the effectiveness of these models: class imbalance and class noise. Software quality datasets are, by their nature, imbalanced. That is, most of a software system's faults can be found in a small percentage of software modules. Therefore, the number of fault-prone, fp, examples (program modules) in a software project dataset is much smaller than the number of not fault-prone, nfp, examples. Data sampling techniques attempt to alleviate the problem of class imbalance by altering a training dataset's distribution. A program module contains class noise if it is incorrectly labeled. While several studies have been performed to evaluate data sampling methods, the impact of class noise on these techniques has not been adequately addressed. This work presents a systematic set of experiments designed to investigate the impact of both class noise and class imbalance on classification models constructed to identify fault-prone program modules. We analyze the impact of class noise and class imbalance on 11 different learning algorithms (learners) as well as 7 different data sampling techniques. We identify which learners and which data sampling techniques are most robust when confronted with noisy and imbalanced data.