Revisiting the evaluation of defect prediction models

  • Authors:
  • Thilo Mende;Rainer Koschke

  • Affiliations:
  • University of Bremen, Germany;University of Bremen, Germany

  • Venue:
  • PROMISE '09 Proceedings of the 5th International Conference on Predictor Models in Software Engineering
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Defect Prediction Models aim at identifying error-prone parts of a software system as early as possible. Many such models have been proposed, their evaluation, however, is still an open question, as recent publications show. An important aspect often ignored during evaluation is the effort reduction gained by using such models. Models are usually evaluated per module by performance measures used in information retrieval, such as recall, precision, or the area under the ROC curve (AUC). These measures assume that the costs associated with additional quality assurance activities are the same for each module, which is not reasonable in practice. For example, costs for unit testing and code reviews are roughly proportional to the size of a module. In this paper, we investigate this discrepancy using optimal and trivial models. We describe a trivial model that takes only the module size measured in lines of code into account, and compare it to five classification methods. The trivial model performs surprisingly well when evaluated using AUC. However, when an effort-sensitive performance measure is used, it becomes apparent that the trivial model is in fact the worst.