The effects of location access behavior on re-identification risk in a distributed environment

Authors:
Bradley Malin;Edoardo Airoldi
Affiliations:
Department of Biomedical Informatics, Vanderbilt University, Nashville, TN;School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
Venue:
PET'06 Proceedings of the 6th international conference on Privacy Enhancing Technologies
Year:
2006

Citing 7
Cited 4

Limits of Anonymity in Open Environments

IH '02 Revised Papers from the 5th International Workshop on Information Hiding
Information and Communication: Alternative Uses of the Internet in Households

Information Systems Research
How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems

Journal of Biomedical Informatics
Usable privacy and security for personal information management

Communications of the ACM - Personal information management
Composition and Disclosure of Unlinkable Distributed Databases

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Statistical disclosure or intersection attacks on anonymity systems

IH'04 Proceedings of the 6th international conference on Information Hiding
Messin' with texas deriving mother's maiden names using public records

ACNS'05 Proceedings of the Third international conference on Applied Cryptography and Network Security

A de-identifier for medical discharge summaries

Artificial Intelligence in Medicine
Privacy-preserving data publishing: A survey of recent developments

ACM Computing Surveys (CSUR)
An entropy approach to disclosure risk assessment: Lessons from real applications and simulated domains

Decision Support Systems
Analyzing characteristic host access patterns for re-identification of web user sessions

NordSec'10 Proceedings of the 15th Nordic conference on Information Security Technology for Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we investigate how location access patterns influence the re-identification of seemingly anonymous data. In the real world, individuals visit different locations that gather similar information. For instance, multiple hospitals collect health information on the same patient. To protect anonymity for research purposes, hospitals share sensitive data, such as DNA sequences, stripped of explicit identifiers. Separately, for administrative functions, identified data, stripped of DNA, is made available. On a hospital by hospital basis, each pair of DNA and identified databases appears unlinkable, however, links can be established when multiple locations' database are studied. This problem, known as trail re-identification, is a generalized phenomenon and occurs because an individual's location access pattern can be matched across the shared databases. Data holders can not exchange data to find and suppress trails that would be re-identified. Thus, it is important to assess the re-identification risk in a system in order to develop techniques to mitigate it. In this research, we evaluate several real world datasets and observe trail re-identification is related to the number of people to places. To study this phenomenon in more detail, we develop a generative model for location access patterns that simulates observed behavior. We evaluate trail re-identification risk in a range of simulated patterns and our findings suggest that the skew of the distribution of people to places is one of the main factors that drives trail re-identification.