A tree-based approach to preserve the privacy of software engineering data and predictive models

  • Authors:
  • Yu Fu;A. Güneş Koru;Zhiyuan Chen;Khaled El Emam

  • Affiliations:
  • UMBC, Baltimore, MD;UMBC, Baltimore, MD;UMBC, Baltimore, MD;University of Ottawa, Ottawa, CA

  • Venue:
  • PROMISE '09 Proceedings of the 5th International Conference on Predictor Models in Software Engineering
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

In empirical disciplines, data sharing leads to verifiable research and facilitates future research studies. Recent efforts of the PROMISE community contributed to data sharing and reproducible research in software engineering. However, an important portion of data used in empirical software engineering research still remains classified. This situation is unlikely to change because many companies, governments, and defense organizations will be always hesitant to share their project data such as, effort and defect data, due to various confidentiality, privacy, and security concerns. In this paper, we present, demonstrate, and evaluate a novel tree-based data perturbation approach. This approach does not only preserve privacy effectively, but it also preserves the predictive patterns in the original data set. Consequently, the empirical software engineering researchers will have access to another category of data sets, transformed data sets, which will increase the verifiability of research results and facilitate the future research studies in this area. Our approach can be immediately useful to many researchers and organizations who are willing to share their software engineering data but cannot do so due to privacy concerns.