Variance analysis in software fault prediction models

Authors:
Yue Jiang;Jie Lin;Bojan Cukic;Tim Menzies
Affiliations:
Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV;Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV;Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV;Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV
Venue:
ISSRE'09 Proceedings of the 20th IEEE international conference on software reliability engineering
Year:
2009

Citing 20
Cited 1

A Critique of Software Defect Prediction Models

IEEE Transactions on Software Engineering
Random Forests

Machine Learning
Classification of Fault-Prone Software Modules: Prior Probabilities,Costs, and Model Evaluation

Empirical Software Engineering
Cost-Benefit Analysis of Software Quality Models

Software Quality Control
Robust Prediction of Fault-Proneness by Random Forests

ISSRE '04 Proceedings of the 15th International Symposium on Software Reliability Engineering
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
Data Mining Static Code Attributes to Learn Defect Predictors

IEEE Transactions on Software Engineering
A Replicated Quantitative Analysis of Fault Distributions in Complex Software Systems

IEEE Transactions on Software Engineering
A Multivariate Analysis of Static Code Attributes for Defect Prediction

QSIC '07 Proceedings of the Seventh International Conference on Quality Software
Fault Prediction using Early Lifecycle Data

ISSRE '07 Proceedings of the The 18th IEEE International Symposium on Software Reliability
The influence of organizational structure on software quality: an empirical case study

Proceedings of the 30th international conference on Software engineering
Comparing design and code metrics for software quality prediction

Proceedings of the 4th international workshop on Predictor models in software engineering
A critical analysis of variants of the AUC

Machine Learning
Cross-validation and bootstrapping are unreliable in small sample classification

Pattern Recognition Letters
An empirical investigation of tree ensembles in biometrics and bioinformatics research

An empirical investigation of tree ensembles in biometrics and bioinformatics research
Techniques for evaluating fault prediction models

Empirical Software Engineering
Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings

IEEE Transactions on Software Engineering
Cost Curve Evaluation of Fault Prediction Models

ISSRE '08 Proceedings of the 2008 19th International Symposium on Software Reliability Engineering
A study of cross-validation and bootstrap for accuracy estimation and model selection

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2

Software measurement data reduction using ensemble techniques

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Software fault prediction models play an important role in software quality assurance. They identify software subsystems (modules, components, classes, or files) which are likely to contain faults. These subsystems, in turn, receive additional resources for verification and validation activities. Fault prediction models are binary classifiers typically developed using one of the supervised learning techniques from either a subset of the fault data from the current project or from a similar past project. In practice, it is critical that such models provide a reliable prediction performance on the data not used in training. Variance is an important reliability indicator of software fault prediction models. However, variance is often ignored or barely mentioned in many published studies. In this paper, through the analysis of twelve data sets from a public software engineering repository from the perspective of variance, we explore the following five questions regarding fault prediction models: (1) Do different types of classification performance measures exhibit different variance? (2) Does the size of the data set imply a more (or less) accurate prediction performance? (3) Does the size of training subset impact model's stability? (4) Do different classifiers consistently exhibit different performance in terms of model's variance? (5) Are there differences between variance from 1000 runs and 10 runs of 10-fold cross validation experiments? Our results indicate that variance is a very important factor in understanding fault prediction models and we recommend the best practice for reporting variance in empirical software engineering studies.