An investigation on the feasibility of cross-project defect prediction

  • Authors:
  • Zhimin He;Fengdi Shu;Ye Yang;Mingshu Li;Qing Wang

  • Affiliations:
  • Laboratory for Internet Software Technologies, Institute of Software Chinese Academy of Sciences, Beijing, China 100190 and Graduate University Chinese Academy of Sciences, Beijing, China 100190;Laboratory for Internet Software Technologies, Institute of Software Chinese Academy of Sciences, Beijing, China 100190;Laboratory for Internet Software Technologies, Institute of Software Chinese Academy of Sciences, Beijing, China 100190;Laboratory for Internet Software Technologies, Institute of Software Chinese Academy of Sciences, Beijing, China 100190 and State Key Laboratory of Computer Science, Institute of Software, Chinese ...;Laboratory for Internet Software Technologies, Institute of Software Chinese Academy of Sciences, Beijing, China 100190

  • Venue:
  • Automated Software Engineering
  • Year:
  • 2012

Quantified Score

Hi-index 0.02

Visualization

Abstract

Software defect prediction helps to optimize testing resources allocation by identifying defect-prone modules prior to testing. Most existing models build their prediction capability based on a set of historical data, presumably from the same or similar project settings as those under prediction. However, such historical data is not always available in practice. One potential way of predicting defects in projects without historical data is to learn predictors from data of other projects. This paper investigates defect predictions in the cross-project context focusing on the selection of training data. We conduct three large-scale experiments on 34 data sets obtained from 10 open source projects. Major conclusions from our experiments include: (1) in the best cases, training data from other projects can provide better prediction results than training data from the same project; (2) the prediction results obtained using training data from other projects meet our criteria for acceptance on the average level, defects in 18 out of 34 cases were predicted at a Recall greater than 70% and a Precision greater than 50%; (3) results of cross-project defect predictions are related with the distributional characteristics of data sets which are valuable for training data selection. We further propose an approach to automatically select suitable training data for projects without historical data. Prediction results provided by the training data selected by using our approach are comparable with those provided by training data from the same project.