Training data selection for cross-project defect prediction

Authors:
Steffen Herbold
Affiliations:
University of Göttingen, Göttingen, Germany
Venue:
Proceedings of the 9th International Conference on Predictive Models in Software Engineering
Year:
2013

Citing 18
Cited 0

A Validation of Object-Oriented Design Metrics as Quality Indicators

IEEE Transactions on Software Engineering
When Is ''Nearest Neighbor'' Meaningful?

ICDT '99 Proceedings of the 7th International Conference on Database Theory
On the Surprising Behavior of Distance Metrics in High Dimensional Spaces

ICDT '01 Proceedings of the 8th International Conference on Database Theory
Predicting the Location and Number of Faults in Large Software Systems

IEEE Transactions on Software Engineering
Mining metrics to predict component failures

Proceedings of the 28th international conference on Software engineering
Data Mining Static Code Attributes to Learn Defect Predictors

IEEE Transactions on Software Engineering
A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction

Proceedings of the 30th international conference on Software engineering
Adapting a fault prediction model to allow inter languagereuse

Proceedings of the 4th international workshop on Predictor models in software engineering
Cross-project defect prediction: a large scale experiment on data vs. domain vs. process

Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
On the relative value of cross-company and within-company data for defect prediction

Empirical Software Engineering
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Towards logistic regression models for predicting fault-prone code across software projects

ESEM '09 Proceedings of the 2009 3rd International Symposium on Empirical Software Engineering and Measurement
Comparing the effectiveness of several modeling methods for fault prediction

Empirical Software Engineering
Towards identifying software project clusters with regard to defect prediction

Proceedings of the 6th International Conference on Predictive Models in Software Engineering
An investigation on the feasibility of cross-project defect prediction

Automated Software Engineering
Local vs. global models for effort estimation and defect prediction

ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
Recalling the "imprecision" of cross-project defect prediction

Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering
Multi-objective Cross-Project Defect Prediction

ICST '13 Proceedings of the 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Software defect prediction has been a popular research topic in recent years and is considered as a means for the optimization of quality assurance activities. Defect prediction can be done in a within-project or a cross-project scenario. The within-project scenario produces results with a very high quality, but requires historic data of the project, which is often not available. For the cross-project prediction, the data availability is not an issue as data from other projects is readily available, e.g., in repositories like PROMISE. However, the quality of the defect prediction results is too low for practical use. Recent research showed that the selection of appropriate training data can improve the quality of cross-project defect predictions. In this paper, we propose distance-based strategies for the selection of training data based on distributional characteristics of the available data. We evaluate the proposed strategies in a large case study with 44 data sets obtained from 14 open source projects. Our results show that our training data selection strategy improves the achieved success rate of cross-project defect predictions significantly. However, the quality of the results still cannot compete with within-project defect prediction.