Variance Analysis in Software Fault Prediction Models

  • Authors:
  • Yue Jiang;Jie Lin;Bojan Cukic;Tim Menzies

  • Affiliations:
  • -;-;-;-

  • Venue:
  • ISSRE '09 Proceedings of the 2009 20th International Symposium on Software Reliability Engineering
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Software fault prediction models play an important role in softwarequality assurance. They identify software subsystems (modules,components, classes, or files) which are likely to contain faults.These subsystems, in turn, receive additional resources forverification and validation activities.Fault prediction models arebinary classifiers typically developed using one of the supervisedlearning techniques from either a subset of the fault data from thecurrent project or from a similar past project.In practice, itis critical that such models provide a reliable predictionperformance on the data not used in training.Variance is animportant reliability indicator of software fault prediction models.However, variance is often ignored or barely mentioned in manypublished studies. In this paper, through the analysis of twelvedata sets from a public software engineering repository from theperspective of variance, we explore the following five questionsregarding fault prediction models:(1) Do different types ofclassification performance measures exhibit different variance? (2)Does the size of the data set imply a more (or less) accurateprediction performance? (3) Does the size of training subset impactmodel's stability? (4) Do different classifiers consistently exhibitdifferent performance in terms of model's variance? (5) Are theredifferences between variance from 1000 runs and 10 runs of 10-fold crossvalidation experiments?Our results indicate that variance is avery important factor in understanding fault prediction models andwe recommend the best practice for reporting variance in empiricalsoftware engineering studies.