Comparing the effectiveness of several modeling methods for fault prediction

  • Authors:
  • Elaine J. Weyuker;Thomas J. Ostrand;Robert M. Bell

  • Affiliations:
  • AT&T Labs - Research, Florham Park, USA 07932;AT&T Labs - Research, Florham Park, USA 07932;AT&T Labs - Research, Florham Park, USA 07932

  • Venue:
  • Empirical Software Engineering
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We compare the effectiveness of four modeling methods--negative binomial regression, recursive partitioning, random forests and Bayesian additive regression trees--for predicting the files likely to contain the most faults for 28 to 35 releases of three large industrial software systems. Predictor variables included lines of code, file age, faults in the previous release, changes in the previous two releases, and programming language. To compare the effectiveness of the different models, we use two metrics--the percent of faults contained in the top 20% of files identified by the model, and a new, more general metric, the fault-percentile-average. The negative binomial regression and random forests models performed significantly better than recursive partitioning and Bayesian additive regression trees, as assessed by either of the metrics. For each of the three systems, the negative binomial and random forests models identified 20% of the files in each release that contained an average of 76% to 94% of the faults.