Sample size vs. bias in defect prediction

Authors:
Foyzur Rahman;Daryl Posnett;Israel Herraiz;Premkumar Devanbu
Affiliations:
UC Davis, USA;UC Davis, USA;Universidad Politécnica de Madrid, Spain;UC Davis, USA
Venue:
Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering
Year:
2013

Citing 16
Cited 0

Hipikat: recommending pertinent software development artifacts

Proceedings of the 25th International Conference on Software Engineering
Identifying Reasons for Software Changes Using Historic Databases

ICSM '00 Proceedings of the International Conference on Software Maintenance (ICSM'00)
Populating a Release History Database from Version Control and Bug Tracking Systems

ICSM '03 Proceedings of the International Conference on Software Maintenance
Data Mining Static Code Attributes to Learn Defect Predictors

IEEE Transactions on Software Engineering
Predicting Faults from Cached History

ICSE '07 Proceedings of the 29th international conference on Software Engineering
Predicting Defects for Eclipse

PROMISE '07 Proceedings of the Third International Workshop on Predictor Models in Software Engineering
A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction

Proceedings of the 30th international conference on Software engineering
Fair and balanced?: bias in bug-fix datasets

Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
A systematic and comprehensive investigation of methods to build and evaluate fault prediction models

Journal of Systems and Software
The missing links: bugs and bug-fix commits

Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
A Case Study of Bias in Bug-Fix Datasets

WCRE '10 Proceedings of the 2010 17th Working Conference on Reverse Engineering
Dealing with noise in defect prediction

Proceedings of the 33rd International Conference on Software Engineering
Don't touch my code!: examining the effects of ownership on software quality

Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering
ReLink: recovering links between bugs and changes

Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering
Ecological inference in empirical software engineering

ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
Recalling the "imprecision" of cross-project defect prediction

Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most empirical disciplines promote the reuse and sharing of datasets, as it leads to greater possibility of replication. While this is increasingly the case in Empirical Software Engineering, some of the most popular bug-fix datasets are now known to be biased. This raises two significant concerns: first, that sample bias may lead to underperforming prediction models, and second, that the external validity of the studies based on biased datasets may be suspect. This issue has raised considerable consternation in the ESE literature in recent years. However, there is a confounding factor of these datasets that has not been examined carefully: size. Biased datasets are sampling only some of the data that could be sampled, and doing so in a biased fashion; but biased samples could be smaller, or larger. Smaller data sets in general provide less reliable bases for estimating models, and thus could lead to inferior model performance. In this setting, we ask the question, what affects performance more, bias, or size? We conduct a detailed, large-scale meta-analysis, using simulated datasets sampled with bias from a high-quality dataset which is relatively free of bias. Our results suggest that size always matters just as much bias direction, and in fact much more than bias direction when considering information-retrieval measures such as AUCROC and F-score. This indicates that at least for prediction models, even when dealing with sampling bias, simply finding larger samples can sometimes be sufficient. Our analysis also exposes the complexity of the bias issue, and raises further issues to be explored in the future.