Uniqueness and how it impacts privacy in health-related social science datasets

Authors:
A. Cheyenne Solomon;Raquel Hill;Erick Janssen;Stephanie A. Sanders;Julia R. Heiman
Affiliations:
Indiana University, Bloomington, IN, USA;Indiana University, Bloomington, IN, USA;Indiana University, Bloomington, IN, USA;Indiana University, Bloomington, IN, USA;Indiana University, Bloomington, IN, USA
Venue:
Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium
Year:
2012

Citing 6
Cited 0

k-anonymity: a model for protecting privacy

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems

Journal of Biomedical Informatics
Robust De-anonymization of Large Sparse Datasets

SP '08 Proceedings of the 2008 IEEE Symposium on Security and Privacy
To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles

Proceedings of the 18th international conference on World wide web
Differentially private data release through multidimensional partitioning

SDM'10 Proceedings of the 7th VLDB conference on Secure data management
Differential privacy

ICALP'06 Proceedings of the 33rd international conference on Automata, Languages and Programming - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Social scientists, like those performing research at the Kinsey Institute for Research in Sex, Gender and Reproduction, may use surveys to gather large amounts of sensitive data. Unlike purely medical-related datasets, these social science datasets tend to be sparse and high-dimensional, which presents opportunities to characterize participants in the dataset in unique ways. These unique characterizations may enable individuals to be linked to external data in ways that have not been previously considered. Therefore, traditional approaches to de-identifying data, such as fulfilling HIPAA requirements, may not be sufficient for preventing the re-identification of participants in large social science datasets. In this paper, we evaluate the statistical characteristics of two high-dimensional social science datasets to better understand how unique features impact privacy. We apply a class of statistical de-anonymization attacks in an attempt to achieve theoretical re-identification of participants. We assume that an attacker has exact knowledge of a subset of attribute values for a particular record, and wants to link this subset of data to the actual record to discover the remaining content. We show that although 98% of the records within the dataset are unique given any three attributes, re-identification of the records may not be easily achieved. We attribute limited re-identification to the inherent similarity in the human behavior that the scientists measure. This work is the first to characterize re-identification risks in high-dimensional data that is collected in surveys designed to capture the various behaviors and experiences of groups of individuals.