Predicting high-risk program modules by selecting the right software measurements

  • Authors:
  • Kehan Gao;Taghi M. Khoshgoftaar;Naeem Seliya

  • Affiliations:
  • Eastern Connecticut State University, Willimantic, USA 06226;Florida Atlantic University, Boca Raton, USA 33431;University of Michigan--Dearborn, Dearborn, USA 48128

  • Venue:
  • Software Quality Control
  • Year:
  • 2012

Quantified Score

Hi-index 0.01

Visualization

Abstract

A timely detection of high-risk program modules in high-assurance software is critical for avoiding the high consequences of operational failures. While software risk can initiate from external sources, such as management or outsourcing, software quality is adversely affected when internal software risks are realized, such as improper practice of standard software processes or lack of a defined software quality infrastructure. Practitioners employ various techniques to identify and rectify high-risk or low-quality program modules. Effectiveness of detecting such modules is affected by the software measurements used, making feature selection an important step during software quality prediction. We use a wrapper-based feature ranking technique to select the optimal set of software metrics to build defect prediction models. We also address the adverse effects of class imbalance (very few low-quality modules compared to high-quality modules), a practical problem observed in high-assurance systems. Applying a data sampling technique followed by feature selection is a relatively unique contribution of our work. We present a comprehensive investigation on the impact of data sampling followed by attribute selection on the defect predictors built with imbalanced data. The case study data are obtained from several real-world high-assurance software projects. The key results are that attribute selection is more efficient when applied after data sampling, and defect prediction performance generally improves after applying data sampling and feature selection.