Variance analysis in software fault prediction models

  • Authors:
  • Yue Jiang;Jie Lin;Bojan Cukic;Tim Menzies

  • Affiliations:
  • Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV;Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV;Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV;Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV

  • Venue:
  • ISSRE'09 Proceedings of the 20th IEEE international conference on software reliability engineering
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Software fault prediction models play an important role in software quality assurance. They identify software subsystems (modules, components, classes, or files) which are likely to contain faults. These subsystems, in turn, receive additional resources for verification and validation activities. Fault prediction models are binary classifiers typically developed using one of the supervised learning techniques from either a subset of the fault data from the current project or from a similar past project. In practice, it is critical that such models provide a reliable prediction performance on the data not used in training. Variance is an important reliability indicator of software fault prediction models. However, variance is often ignored or barely mentioned in many published studies. In this paper, through the analysis of twelve data sets from a public software engineering repository from the perspective of variance, we explore the following five questions regarding fault prediction models: (1) Do different types of classification performance measures exhibit different variance? (2) Does the size of the data set imply a more (or less) accurate prediction performance? (3) Does the size of training subset impact model's stability? (4) Do different classifiers consistently exhibit different performance in terms of model's variance? (5) Are there differences between variance from 1000 runs and 10 runs of 10-fold cross validation experiments? Our results indicate that variance is a very important factor in understanding fault prediction models and we recommend the best practice for reporting variance in empirical software engineering studies.