Privacy and utility for defect prediction: experiments with MORPH

Authors:
Fayola Peters;Tim Menzies
Affiliations:
West Virginia University, USA;West Virginia University, USA
Venue:
Proceedings of the 34th International Conference on Software Engineering
Year:
2012

Citing 27
Cited 2

Understanding and Controlling Software Costs

IEEE Transactions on Software Engineering
Random Forests

Machine Learning
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
k-anonymity: a model for protecting privacy

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
What We Have Learned About Fighting Defects

METRICS '02 Proceedings of the 8th International Symposium on Software Metrics
State-of-the-art in privacy preserving data mining

ACM SIGMOD Record
Data Mining

Data Mining
L-diversity: Privacy beyond k-anonymity

ACM Transactions on Knowledge Discovery from Data (TKDD)
Data Mining Static Code Attributes to Learn Defect Predictors

IEEE Transactions on Software Engineering
Privacy-Preserving Data Mining Systems

Computer
Cross versus Within-Company Cost Estimation Studies: A Systematic Review

IEEE Transactions on Software Engineering
Workload-aware anonymization techniques for large-scale datasets

ACM Transactions on Database Systems (TODS)
The cost of privacy: destruction of data-mining utility in anonymized data publishing

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Do too many cooks spoil the broth? Using the number of developers to enhance defect prediction models

Empirical Software Engineering
Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings

IEEE Transactions on Software Engineering
Practical considerations in deploying AI for defect prediction: a case study within the Turkish telecommunication industry

PROMISE '09 Proceedings of the 5th International Conference on Predictor Models in Software Engineering
Cross-project defect prediction: a large scale experiment on data vs. domain vs. process

Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
On the relative value of cross-company and within-company data for defect prediction

Empirical Software Engineering
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Privacy-preserving data publishing: A survey of recent developments

ACM Computing Surveys (CSUR)
Hybrid microdata using microaggregation

Information Sciences: an International Journal
Approximate algorithms with generalizing attribute values for k-anonymity

Information Systems
When to use data from other projects for effort estimation

Proceedings of the IEEE/ACM international conference on Automated software engineering
Is Data Privacy Always Good for Software Testing?

ISSRE '10 Proceedings of the 2010 IEEE 21st International Symposium on Software Reliability Engineering
Using Faults-Slip-Through Metric as a Predictor of Fault-Proneness

APSEC '10 Proceedings of the 2010 Asia Pacific Software Engineering Conference
Camouflage: automated anonymization of field data

Proceedings of the 33rd International Conference on Software Engineering
How to Find Relevant Data for Effort Estimation?

ESEM '11 Proceedings of the 2011 International Symposium on Empirical Software Engineering and Measurement

Data science for software engineering

Proceedings of the 2013 International Conference on Software Engineering
Beyond data mining; towards "idea engineering"

Proceedings of the 9th International Conference on Predictive Models in Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Ideally, we can learn lessons from software projects across multiple organizations. However, a major impediment to such knowledge sharing are the privacy concerns of software development organizations. This paper aims to provide defect data-set owners with an effective means of privatizing their data prior to release. We explore MORPH which understands how to maintain class boundaries in a data-set. MORPH is a data mutator that moves the data a random distance, taking care not to cross class boundaries. The value of training on this MORPHed data is tested via a 10-way within learning study and a cross learning study using Random Forests, Naive Bayes, and Logistic Regression for ten object-oriented defect data-sets from the PROMISE data repository. Measured in terms of exposure of sensitive attributes, the MORPHed data was four times more private than the unMORPHed data. Also, in terms of the f-measures, there was little difference between the MORPHed and unMORPHed data (original data and data privatized by data-swapping) for both the cross and within study. We conclude that at least for the kinds of OO defect data studied in this project, data can be privatized without concerns for inference efficacy.