Improving software-quality predictions with data sampling and boosting

  • Authors:
  • Chris Seiffert;Taghi M. Khoshgoftaar;Jason Van Hulse

  • Affiliations:
  • Florida Atlantic University, Boca Raton, FL;Florida Atlantic University, Boca Raton, FL;Florida Atlantic University, Boca Raton, FL

  • Venue:
  • IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Software-quality data sets tend to fall victim to the class-imbalance problem that plagues so many other application domains. The majority of faults in a software system, particularly high-assurance systems, usually lie in a very small percentage of the software modules. This imbalance between the number of fault-prone (fp) and non-fp (nfp) modules can have a severely negative impact on a data-mining technique's ability to differentiate between the two. This paper addresses the class-imbalance problem as it pertains to the domain of software-quality prediction. We present a comprehensive empirical study examining two different methodologies, data sampling and boosting, for improving the performance of decision-tree models designed to identify fp software modules. This paper applies five data-sampling techniques and boosting to 15 software-quality data sets of different sizes and levels of imbalance. Nearly 50 000 models were built for the experiments contained in this paper. Our results show that while data-sampling techniques are very effective in improving the performance of such models, boosting almost always outperforms even the best data-sampling techniques. This significant result, which, to our knowledge, has not been previously reported, has important consequences for practitioners developing software-quality classification models.