On the relative value of cross-company and within-company data for defect prediction

Authors:
Burak Turhan;Tim Menzies;Ayşe B. Bener;Justin Di Stefano
Affiliations:
Department of Computer Engineering, Bogazici University, Istanbul, Turkey;Lane Department of Computer Science and Electrical Engineering, Morgantown, USA;Department of Computer Engineering, Bogazici University, Istanbul, Turkey;Lane Department of Computer Science and Electrical Engineering, Morgantown, USA
Venue:
Empirical Software Engineering
Year:
2009

Citing 45
Cited 41

Advances in software inspections

IEEE Transactions on Software Engineering
Software reliability: measurement, prediction, application

Software reliability: measurement, prediction, application
Understanding and Controlling Software Costs

IEEE Transactions on Software Engineering
C4.5: programs for machine learning

C4.5: programs for machine learning
Machine Learning Approaches to Estimating Software Development Effort

IEEE Transactions on Software Engineering
The mythical man-month (anniversary ed.)

The mythical man-month (anniversary ed.)
Estimating Software Project Effort Using Analogies

IEEE Transactions on Software Engineering
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
Predicting Fault Incidence Using Software Change History

IEEE Transactions on Software Engineering
Software evolution: code delta and code churn

Journal of Systems and Software - Special issue on software maintenance
Software Verification and Validation for Practitioners and Managers, Second Edition

Software Verification and Validation for Practitioners and Managers, Second Edition
Software Metrics: A Rigorous and Practical Approach

Software Metrics: A Rigorous and Practical Approach
Elements of Software Science (Operating and programming systems series)

Elements of Software Science (Operating and programming systems series)
Lessons learned from 25 years of process improvement: the rise and fall of the NASA software engineering laboratory

Proceedings of the 24th International Conference on Software Engineering
How Perspective-Based Reading Can Improve Requirements Inspections

Computer
Empirically Guided Software Development Using Metric-Based Classification Trees

IEEE Software
Safe and Simple Software Cost Analysis

IEEE Software
Complexity Measure Evaluation and Selection

IEEE Transactions on Software Engineering
Model-Based Tests of Truisms

Proceedings of the 17th IEEE international conference on Automated software engineering
What We Have Learned About Fighting Defects

METRICS '02 Proceedings of the 8th International Symposium on Software Metrics
Analogy-Based Practical Classification Rules for Software Quality Estimation

Empirical Software Engineering
Developing Fault Predictors for Evolving Software Systems

METRICS '03 Proceedings of the 9th International Symposium on Software Metrics
Comparative Assessment of Software Quality Classification Techniques: An Empirical Case Study

Empirical Software Engineering
Noise Identification with the k-Means Algorithm

ICTAI '04 Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence
Static analysis tools as early indicators of pre-release defect density

Proceedings of the 27th international conference on Software engineering
Feature subset selection can improve software cost estimation accuracy

PROMISE '05 Proceedings of the 2005 workshop on Predictor models in software engineering
Advancing Candidate Link Generation for Requirements Tracing: The Study of Methods

IEEE Transactions on Software Engineering
Looking for bugs in all the right places

Proceedings of the 2006 international symposium on Software testing and analysis
Predicting fault-prone components in a java legacy system

Proceedings of the 2006 ACM/IEEE international symposium on Empirical software engineering
Data Mining

Data Mining
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
Data Mining Static Code Attributes to Learn Defect Predictors

IEEE Transactions on Software Engineering
Cross versus Within-Company Cost Estimation Studies: A Systematic Review

IEEE Transactions on Software Engineering
Automating algorithms for the identification of fault-prone files

Proceedings of the 2007 international symposium on Software testing and analysis
Architecture-Based Software Reliability: Why Only a Few Parameters Matter?

COMPSAC '07 Proceedings of the 31st Annual International Computer Software and Applications Conference - Volume 01
A Complexity Measure

IEEE Transactions on Software Engineering
Problems with Precision: A Response to "Comments on 'Data Mining Static Code Attributes to Learn Defect Predictors'"

IEEE Transactions on Software Engineering
Fault Prediction using Early Lifecycle Data

ISSRE '07 Proceedings of the The 18th IEEE International Symposium on Software Reliability
A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction

Proceedings of the 30th international conference on Software engineering
Implications of ceiling effects in defect predictors

Proceedings of the 4th international workshop on Predictor models in software engineering
Using the Conceptual Cohesion of Classes for Fault Prediction in Object-Oriented Systems

IEEE Transactions on Software Engineering
Theory of relative defect proneness

Empirical Software Engineering
Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings

IEEE Transactions on Software Engineering
Weighted proportional k-interval discretization for naive-Bayes classifiers

PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining
Estimating continuous distributions in Bayesian classifiers

UAI'95 Proceedings of the Eleventh conference on Uncertainty in artificial intelligence

Validation of network measures as indicators of defective modules in software systems

PROMISE '09 Proceedings of the 5th International Conference on Predictor Models in Software Engineering
Practical considerations in deploying AI for defect prediction: a case study within the Turkish telecommunication industry

PROMISE '09 Proceedings of the 5th International Conference on Predictor Models in Software Engineering
How to build repeatable experiments

PROMISE '09 Proceedings of the 5th International Conference on Predictor Models in Software Engineering
Merits of using repository metrics in defect prediction for open source projects

FLOSS '09 Proceedings of the 2009 ICSE Workshop on Emerging Trends in Free/Libre/Open Source Software Research and Development
Cross-project defect prediction: a large scale experiment on data vs. domain vs. process

Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
Reducing false alarms in software defect prediction by decision threshold optimization

ESEM '09 Proceedings of the 2009 3rd International Symposium on Empirical Software Engineering and Measurement
Design-level metrics estimation based on code metrics

Proceedings of the 2010 ACM Symposium on Applied Computing
Automatically finding the control variables for complex system behavior

Automated Software Engineering
Defect prediction from static code features: current results, limitations, new approaches

Automated Software Engineering
When to use data from other projects for effort estimation

Proceedings of the IEEE/ACM international conference on Automated software engineering
Practical considerations in deploying statistical methods for defect prediction: A case study within the Turkish telecommunications industry

Information and Software Technology
Usage of multiple prediction models based on defect categories

Proceedings of the 6th International Conference on Predictive Models in Software Engineering
Towards identifying software project clusters with regard to defect prediction

Proceedings of the 6th International Conference on Predictive Models in Software Engineering
Better, faster, and cheaper: what is better software?

Proceedings of the 6th International Conference on Predictive Models in Software Engineering
Different strokes for different folks: a case study on software metrics for different defect categories

Proceedings of the 2nd International Workshop on Emerging Trends in Software Metrics
An industrial case study of classifier ensembles for locating software defects

Software Quality Control
Transfer learning for cross-company software defect prediction

Information and Software Technology
Sample-based software defect prediction with active and semi-supervised learning

Automated Software Engineering
Guest editorial: learning to organize testing

Automated Software Engineering
An investigation on the feasibility of cross-project defect prediction

Automated Software Engineering
On the dataset shift problem in software engineering prediction models

Empirical Software Engineering
Regularities in learning defect predictors

PROFES'10 Proceedings of the 11th international conference on Product-Focused Software Process Improvement
Local vs. global models for effort estimation and defect prediction

ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
Evaluating defect prediction approaches: a benchmark and an extensive comparison

Empirical Software Engineering
Privacy and utility for defect prediction: experiments with MORPH

Proceedings of the 34th International Conference on Software Engineering
Defect, defect, defect: defect prediction 2.0

Proceedings of the 8th International Conference on Predictive Models in Software Engineering
Web effort estimation: the value of cross-company data set compared to single-company data set

Proceedings of the 8th International Conference on Predictive Models in Software Engineering
Size doesn't matter?: on the value of software size features for effort estimation

Proceedings of the 8th International Conference on Predictive Models in Software Engineering
Software mining and fault prediction

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Dione: an integrated measurement and defect prediction solution

Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering
Recalling the "imprecision" of cross-project defect prediction

Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering
Predicting aging-related bugs using software complexity metrics

Performance Evaluation
Empirical evaluation of the effects of mixed project data on learning defect predictors

Information and Software Technology
Transfer defect learning

Proceedings of the 2013 International Conference on Software Engineering
How, and why, process metrics are better

Proceedings of the 2013 International Conference on Software Engineering
Predicting bug-fixing time: an empirical study of commercial software projects

Proceedings of the 2013 International Conference on Software Engineering
Data science for software engineering

Proceedings of the 2013 International Conference on Software Engineering
Better cross company defect prediction

Proceedings of the 10th Working Conference on Mining Software Repositories
Training data selection for cross-project defect prediction

Proceedings of the 9th International Conference on Predictive Models in Software Engineering
Building a second opinion: learning cross-company data

Proceedings of the 9th International Conference on Predictive Models in Software Engineering
Organizational social structures for software engineering

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a practical defect prediction approach for companies that do not track defect related data. Specifically, we investigate the applicability of cross-company (CC) data for building localized defect predictors using static code features. Firstly, we analyze the conditions, where CC data can be used as is. These conditions turn out to be quite few. Then we apply principles of analogy-based learning (i.e. nearest neighbor (NN) filtering) to CC data, in order to fine tune these models for localization. We compare the performance of these models with that of defect predictors learned from within-company (WC) data. As expected, we observe that defect predictors learned from WC data outperform the ones learned from CC data. However, our analyses also yield defect predictors learned from NN-filtered CC data, with performance close to, but still not better than, WC data. Therefore, we perform a final analysis for determining the minimum number of local defect reports in order to learn WC defect predictors. We demonstrate in this paper that the minimum number of data samples required to build effective defect predictors can be quite small and can be collected quickly within a few months. Hence, for companies with no local defect data, we recommend a two-phase approach that allows them to employ the defect prediction process instantaneously. In phase one, companies should use NN-filtered CC data to initiate the defect prediction process and simultaneously start collecting WC (local) data. Once enough WC data is collected (i.e. after a few months), organizations should switch to phase two and use predictors learned from WC data.