An empirical study of the classification performance of learners on imbalanced and noisy software quality data

  • Authors:
  • Chris Seiffert;Taghi M. Khoshgoftaar;Jason Van Hulse;Andres Folleco

  • Affiliations:
  • Florida Atlantic University, Boca Raton, FL 33431, USA;Florida Atlantic University, Boca Raton, FL 33431, USA;Florida Atlantic University, Boca Raton, FL 33431, USA;Florida Atlantic University, Boca Raton, FL 33431, USA

  • Venue:
  • Information Sciences: an International Journal
  • Year:
  • 2014

Quantified Score

Hi-index 0.07

Visualization

Abstract

Data mining techniques are commonly used to construct models for identifying software modules that are most likely to contain faults. In doing so, an organization's limited resources can be intelligently allocated with the goal of detecting and correcting the greatest number of faults. However, there are two characteristics of software quality datasets that can negatively impact the effectiveness of these models: class imbalance and class noise. Software quality datasets are, by their nature, imbalanced. That is, most of a software system's faults can be found in a small percentage of software modules. Therefore, the number of fault-prone, fp, examples (program modules) in a software project dataset is much smaller than the number of not fault-prone, nfp, examples. Data sampling techniques attempt to alleviate the problem of class imbalance by altering a training dataset's distribution. A program module contains class noise if it is incorrectly labeled. While several studies have been performed to evaluate data sampling methods, the impact of class noise on these techniques has not been adequately addressed. This work presents a systematic set of experiments designed to investigate the impact of both class noise and class imbalance on classification models constructed to identify fault-prone program modules. We analyze the impact of class noise and class imbalance on 11 different learning algorithms (learners) as well as 7 different data sampling techniques. We identify which learners and which data sampling techniques are most robust when confronted with noisy and imbalanced data.