Uncertain Classification of Fault-Prone Software Modules

  • Authors:
  • Taghi M. Khoshgoftaar;Xiaojing Yuan;Edward B. Allen;Wendell D. Jones;John P. Hudepohl

  • Affiliations:
  • Empirical Software Engineering Laboratory, Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL 33431, USA/ taghi@cse.fau.edu;Empirical Software Engineering Laboratory, Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL 33431, USA/ xyuan@cse.fau.edu;Mississippi State University, Mississippi, USA/ edward.allen@computer.org;Nortel Networks, Research Triangle Park, North Carolina, USA/ wjones@asciences.com;Nortel Networks, Research Triangle Park, North Carolina, USA/ hudepohl@nortelnetworks.com

  • Venue:
  • Empirical Software Engineering
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many development organizations try to minimize faults in software as a means for improving customer satisfaction. Assuring high software quality often entails time-consuming and costly development processes. A software quality model based on software metrics can be used to guide enhancement efforts by predicting which modules are fault-prone. This paper presents statistical techniques to determine which predictions by a classification tree should be considered uncertain. We conducted a case study of a large legacy telecommunications system. One release was the basis for the training dataset, and the subsequent release was the basis for the evaluation dataset. We built a classification tree using the TREEDISC algorithm, which is based on χ2 tests of contingency tables. The model predicted whether a module was likely to have faults discovered by customers, or not, based on software product, process, and execution metrics. We simulated practical use of the model by classifying the modules in the evaluation dataset. The model achieved useful accuracy, in spite of the very small proportion of fault-prone modules in the system. We assessed whether the classes assigned to the leaves were appropriate by statistical tests, and found sizable subsets of modules with uncertain classification. Discovering which modules have uncertain classifications allows sophisticated enhancement strategies to resolve uncertainties. Moreover, TREEDISC is especially well suited to identifying uncertain classifications.