An integrated framework for de-identifying unstructured medical data

Authors:
James Gardner;Li Xiong
Affiliations:
Department of Mathematics and Computer Science, Emory University;Department of Mathematics and Computer Science, Emory University
Venue:
Data & Knowledge Engineering
Year:
2009

Citing 30
Cited 6

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
k-anonymity: a model for protecting privacy

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
Transforming data to satisfy privacy constraints

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Cost-Sensitive Learning by Cost-Proportionate Example Weighting

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Iterative record linkage for cleaning and integration

Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Bottom-Up Generalization: A Data Mining Solution to Privacy Protection

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Top-Down Specialization for Information and Privacy Preservation

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Data Privacy through Optimal k-Anonymization

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Privacy and Ownership Preserving of Outsourced Medical Data

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
On the complexity of optimal K-anonymity

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Privacy-enhancing k-anonymization of customer data

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Incognito: efficient full-domain K-anonymity

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
On k-anonymity and the curse of dimensionality

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Mondrian Multidimensional K-Anonymity

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
\ell -Diversity: Privacy Beyond \kappa -Anonymity

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Privacy Protection: p-Sensitive k-Anonymity Property

ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops
Achieving anonymity via clustering

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Injecting utility into anonymized datasets

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Anonymizing sequential releases

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Anatomy: simple and effective privacy preservation

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Role of local context in automatic deidentification of ungrammatical, fragmented text

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Integrating probabilistic extraction models and data mining to discover relations and patterns in text

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Hiding the presence of individuals from shared databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
M-invariance: towards privacy preserving re-publication of dynamic datasets

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Anonymizing bipartite graph data using safe groupings

Proceedings of the VLDB Endowment
Privacy preserving serial data publishing by role composition

Proceedings of the VLDB Endowment
Privacy-preserving data publishing: A survey of recent developments

ACM Computing Surveys (CSUR)
Efficiently inducing features of conditional random fields

UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence

Anonymizing healthcare data: a case study on the blood transfusion service

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Centralized and Distributed Anonymization for High-Dimensional Healthcare Data

ACM Transactions on Knowledge Discovery from Data (TKDD)
An evaluation of feature sets and sampling techniques for de-identification of medical records

Proceedings of the 1st ACM International Health Informatics Symposium
Relationships and data sanitization: a study in scarlet

Proceedings of the 2010 workshop on New security paradigms
A hybrid stepwise approach for de-identifying person names in clinical documents

BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
Editorial: COMPENDIUM: A text summarization system for generating abstracts of research papers

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

While there is an increasing need to share medical information for public health research, such data sharing must preserve patient privacy without disclosing any information that can be used to identify a patient. A considerable amount of research in data privacy community has been devoted to formalizing the notion of identifiability and developing techniques for anonymization but are focused exclusively on structured data. On the other hand, efforts on de-identifying medical text documents in medical informatics community rely on simple identifier removal or grouping techniques without taking advantage of the research developments in the data privacy community. This paper attempts to fill the above gaps and presents a framework and prototype system for de-identifying health information including both structured and unstructured data. We empirically study a simple Bayesian classifier, a Bayesian classifier with a sampling based technique, and a conditional random field based classifier for extracting identifying attributes from unstructured data. We deploy a k-anonymization based technique for de-identifying the extracted data to preserve maximum data utility. We present a set of preliminary evaluations showing the effectiveness of our approach.