Improving defect prediction using temporal features and non linear models

Authors:
Abraham Bernstein;Jayalath Ekanayake;Martin Pinzger
Affiliations:
University of Zurich, Switzerland;University of Zurich, Switzerland;University of Zurich, Switzerland
Venue:
Ninth international workshop on Principles of software evolution: in conjunction with the 6th ESEC/FSE joint meeting
Year:
2007

Citing 16
Cited 15

C4.5: programs for machine learning

C4.5: programs for machine learning
Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
Predicting Fault Incidence Using Software Change History

IEEE Transactions on Software Engineering
Robust Classification for Imprecise Environments

Machine Learning
Identifying Reasons for Software Changes Using Historic Databases

ICSM '00 Proceedings of the International Conference on Software Maintenance (ICSM'00)
Detection of software modules with high debug code churn in a very large legacy system

ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
Static analysis tools as early indicators of pre-release defect density

Proceedings of the 27th international conference on Software engineering
Predicting the Location and Number of Faults in Large Software Systems

IEEE Transactions on Software Engineering
HATARI: raising risk awareness

Proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on Foundations of software engineering
The Top Ten List: Dynamic Fault Prediction

ICSM '05 Proceedings of the 21st IEEE International Conference on Software Maintenance
Predicting defect densities in source code files with decision tree learners

Proceedings of the 2006 international workshop on Mining software repositories
Information theoretic evaluation of change prediction models for large-scale software

Proceedings of the 2006 international workshop on Mining software repositories
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Predicting Defects for Eclipse

ICSEW '07 Proceedings of the 29th International Conference on Software Engineering Workshops
Predicting Defects and Changes with Import Relations

MSR '07 Proceedings of the Fourth International Workshop on Mining Software Repositories
Local and Global Recency Weighting Approach to Bug Prediction

MSR '07 Proceedings of the Fourth International Workshop on Mining Software Repositories

Can developer-module networks predict failures?

Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering
Predicting failures with developer networks and social network analysis

Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering
Software process data quality and characteristics: a historical view on open and closed source projects

Proceedings of the joint international and annual ERCIM workshops on Principles of software evolution (IWPSE) and software evolution (Evol) workshops
Semantic web enabled software analysis

Web Semantics: Science, Services and Agents on the World Wide Web
The missing links: bugs and bug-fix commits

Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
Comparing fine-grained source code changes and code churn for bug prediction

Proceedings of the 8th Working Conference on Mining Software Repositories
Exploring, exposing, and exploiting emails to include human factors in software engineering

Proceedings of the 33rd International Conference on Software Engineering
Using the gini coefficient for bug prediction in eclipse

Proceedings of the 12th International Workshop on Principles of Software Evolution and the 7th annual ERCIM Workshop on Software Evolution
Application and evaluation of inductive reasoning methods for the semantic web and software analysis

RW'11 Proceedings of the 7th international conference on Reasoning web: semantic technologies for the web of data
Are popular classes more defect prone?

FASE'10 Proceedings of the 13th international conference on Fundamental Approaches to Software Engineering
Evaluating defect prediction approaches: a benchmark and an extensive comparison

Empirical Software Engineering
Time variance and defect prediction in software projects

Empirical Software Engineering
Controversy Corner: Preserving knowledge in software projects

Journal of Systems and Software
Incorporating qualitative and quantitative factors for software defect prediction

Proceedings of the 2nd international workshop on Evidential assessment of software technologies
Method-level bug prediction

Proceedings of the ACM-IEEE international symposium on Empirical software engineering and measurement

Quantified Score

Hi-index	0.00

Visualization

Abstract

Predicting the defects in the next release of a large software system is a very valuable asset for the project manger to plan her resources. In this paper we argue that temporal features (or aspects) of the data are central to prediction performance. We also argue that the use of non-linear models, as opposed to traditional regression, is necessary to uncover some of the hidden interrelationships between the features and the defects and maintain the accuracy of the prediction in some cases. Using data obtained from the CVS and Bugzilla repositories of the Eclipse project, we extract a number of temporal features, such as the number of revisions and number of reported issues within the last three months. We then use these data to predict both the location of defects (i.e., the classes in which defects will occur) as well as the number of reported bugs in the next month of the project. To that end we use standard tree-based induction algorithms in comparison with the traditional regression. Our non-linear models uncover the hidden relationships between features and defects, and present them in easy to understand form. Results also show that using the temporal features our prediction model can predict whether a source file will have a defect with an accuracy of 99% (area under ROC curve 0.9251) and the number of defects with a mean absolute error of 0.019 (Spearman's correlation of 0.96).