Defect, defect, defect: defect prediction 2.0

  • Authors:
  • Sunghun Kim

  • Affiliations:
  • The Hong Kong University of Science and Technology

  • Venue:
  • Proceedings of the 8th International Conference on Predictive Models in Software Engineering
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Defect prediction has been a very active research area in software engineering [6--8, 11, 13, 16, 19, 20]. In 1971, Akiyama proposed one of the earliest defect prediction models using Lines of Code (LOC) [1]: "Defect = 4.86 + 0.018LOC." Since then, many effective new defect prediction models and metrics have been proposed. For the prediction models, typical machine learners and regression algorithms such as Naive Bayes, Decision Tree, and Linear Regression are widely used. On the other hand, Kim et al. proposed a cache-based prediction model using bug occurrence properties [9]. Hassan proposed a change entropy model to effectively predict defects [6]. Recently, Bettenburg et al. proposed Multivariate Adaptive Regression Splines to improve defect prediction models by learning from local and global properties together [4]. Besides LOC, many new effective metrics for defect prediction have been proposed. Among them, source code metrics and change history metrics are widely used and yield reasonable defect prediction accuracy. For example, Basili et al. [3] used Chidamber and Kemerer metrics, and Ohlsson et al. [14] used McCabe's cyclomatic complexity for defect prediction. Moser et al. [12] used the number of revisions, authors, and past fixes, and age of a file as defect predictors. Recently, micro interaction metrics (MIMs) [10] and source code quality measures [15] for effective defect prediction are proposed. However, there is much room to improve for defect prediction 2.0. First of all, understanding the actual causes of defects is necessary. Without understanding them, we may reach to nonsensical conclusions from defect prediction results [18]. Many effective prediction models have been proposed, but successful application cases in practice are scarcely reported. To be more attractive for developers in practice, it is desirable to predict defects in finer granularity levels such as the code line or even keyword level. Note that static bug finders such as FindBugs [2] can identify potential bugs in the line level, and many developers find them useful in practice. Dealing with noise in defect data has become an important issue. Bird et al. identified there is non-neglectable noise in defect data [5]. This noise may yield poor and/or meaningless defect prediction results. Cross-prediction is highly desirable: for new projects or projects with limited training data, it is necessary to learn a prediction model using sufficient training data from other projects, and to apply the model to those projects. However, Zimmermann et al. [21] identified cross-project prediction is a challenging problem. Turhan et al. [17] analyzed Cross-Company (CC) and Within-Company (WC) data for defect prediction, and confirmed that it is challenging to reuse CC data directly to predict defects in other companies' software. Overall, defect prediction is a very interesting and promising research area. However, there are still many research challenges and problems to be addressed. Hopefully, this discussion calls new solutions and ideas to address these challenges.