A tree-based approach to preserve the privacy of software engineering data and predictive models

Authors:
Yu Fu;A. Güneş Koru;Zhiyuan Chen;Khaled El Emam
Affiliations:
UMBC, Baltimore, MD;UMBC, Baltimore, MD;UMBC, Baltimore, MD;University of Ottawa, Ottawa, CA
Venue:
PROMISE '09 Proceedings of the 5th International Conference on Predictor Models in Software Engineering
Year:
2009

Citing 18
Cited 1

Learning from Examples: Generation and Evaluation of Decision Trees for Software Resource Analysis

IEEE Transactions on Software Engineering - Special Issue on Artificial Intelligence in Software Applications
Privacy-preserving data mining

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Analyzing and Improving Reliability: A Tree-Based Approach

IEEE Software
Quantitative Analysis of Faults and Failures in a Complex Software System

IEEE Transactions on Software Engineering
k-anonymity: a model for protecting privacy

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
On the Privacy Preserving Properties of Random Data Perturbation Techniques

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
State-of-the-art in privacy preserving data mining

ACM SIGMOD Record
Top-Down Specialization for Information and Privacy Preservation

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Data Privacy through Optimal k-Anonymization

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Incognito: efficient full-domain K-anonymity

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Comparing High-Change Modules and Modules with the Highest Measurement Values in Two Large-Scale Open-Source Products

IEEE Transactions on Software Engineering
Mondrian Multidimensional K-Anonymity

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
\ell -Diversity: Privacy Beyond \kappa -Anonymity

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
A Tree-Based Data Perturbation Approach for Privacy-Preserving Data Mining

IEEE Transactions on Knowledge and Data Engineering
Tuning anonymity level for assuring high data quality: an empirical study.

ESEM '07 Proceedings of the First International Symposium on Empirical Software Engineering and Measurement
Does enforcing anonymity mean decreasing data usefulness?

Proceedings of the 4th ACM workshop on Quality of protection
Review: A systematic review of software fault prediction studies

Expert Systems with Applications: An International Journal
Engineering Privacy

IEEE Transactions on Software Engineering

A privacy protection technique for publishing data mining models and research data

ACM Transactions on Management Information Systems (TMIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In empirical disciplines, data sharing leads to verifiable research and facilitates future research studies. Recent efforts of the PROMISE community contributed to data sharing and reproducible research in software engineering. However, an important portion of data used in empirical software engineering research still remains classified. This situation is unlikely to change because many companies, governments, and defense organizations will be always hesitant to share their project data such as, effort and defect data, due to various confidentiality, privacy, and security concerns. In this paper, we present, demonstrate, and evaluate a novel tree-based data perturbation approach. This approach does not only preserve privacy effectively, but it also preserves the predictive patterns in the original data set. Consequently, the empirical software engineering researchers will have access to another category of data sets, transformed data sets, which will increase the verifiability of research results and facilitate the future research studies in this area. Our approach can be immediately useful to many researchers and organizations who are willing to share their software engineering data but cannot do so due to privacy concerns.