Software quality estimation with limited fault data: a semi-supervised learning perspective

  • Authors:
  • Naeem Seliya;Taghi M. Khoshgoftaar

  • Affiliations:
  • Computer and Information Science, University of Michigan --- Dearborn, Dearborn, USA 48128;Computer Science and Engineering, Florida Atlantic University, Boca Raton, USA 33431

  • Venue:
  • Software Quality Control
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We addresses the important problem of software quality analysis when there is limited software fault or fault-proneness data. A software quality model is typically trained using software measurement and fault data obtained from a previous release or similar project. Such an approach assumes that fault data is available for all the training modules. Various issues in software development may limit the availability of fault-proneness data for all the training modules. Consequently, the available labeled training dataset is such that the trained software quality model may not provide predictions. More specifically, the small set of modules with known fault-proneness labels is not sufficient for capturing the software quality trends of the project. We investigate semi-supervised learning with the Expectation Maximization (EM) algorithm for software quality estimation with limited fault-proneness data. The hypothesis is that knowledge stored in software attributes of the unlabeled program modules will aid in improving software quality estimation. Software data collected from a large NASA software project is used during the semi-supervised learning process. The software quality model is evaluated with multiple test datasets collected from other NASA software projects. Compared to software quality models trained only with the available set of labeled program modules, the EM-based semi-supervised learning scheme improves generalization performance of the software quality models.