A de-identifier for medical discharge summaries

Authors:
Özlem Uzuner;Tawanda C. Sibanda;Yuan Luo;Peter Szolovits
Affiliations:
University at Albany, State University of New York, Draper 114, 135 Western Avenue, Albany, NY 12222, United States;Massachusetts Institute of Technology, Computer Science and Artificial Intelligence Laboratory, 32 Vassar Street, Cambridge, MA 02139, United States;University at Albany, State University of New York, Draper 114, 135 Western Avenue, Albany, NY 12222, United States;Massachusetts Institute of Technology, Computer Science and Artificial Intelligence Laboratory, 32 Vassar Street, Cambridge, MA 02139, United States
Venue:
Artificial Intelligence in Medicine
Year:
2008

Citing 12
Cited 5

Support-Vector Networks

Machine Learning
An Algorithm that Learns What‘s in a Name

Machine Learning - Special issue on natural language learning
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Computational disclosure control: a primer on data privacy protection

Computational disclosure control: a primer on data privacy protection
A simple rule-based part of speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
Privacy: A Machine Learning View

IEEE Transactions on Knowledge and Data Engineering
Efficient support vector classifiers for named entity recognition

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Probabilistic reasoning for entity & relation recognition

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Spectral anonymization of data

Spectral anonymization of data
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
The effects of location access behavior on re-identification risk in a distributed environment

PET'06 Proceedings of the 6th international conference on Privacy Enhancing Technologies

Automatic Detecting Documents Containing Personal Health Information

AIME '09 Proceedings of the 12th Conference on Artificial Intelligence in Medicine: Artificial Intelligence in Medicine
Rule-based information extraction from patients' clinical data

Journal of Biomedical Informatics
Personal health information leak prevention in heterogeneous texts

AdaptLRTtoND '09 Proceedings of the Workshop on Adaptation of Language Resources and Technology to New Domains
Measuring risk and information preservation: toward new metrics for de-identification of clinical texts

Louhi '10 Proceedings of the NAACL HLT 2010 Second Louhi Workshop on Text and Data Mining of Health Documents
Special Communication: Natural language processing: State of the art and prospects for significant progress, a workshop sponsored by the National Library of Medicine

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Objective: Clinical records contain significant medical information that can be useful to researchers in various disciplines. However, these records also contain personal health information (PHI) whose presence limits the use of the records outside of hospitals. The goal of de-identification is to remove all PHI from clinical records. This is a challenging task because many records contain foreign and misspelled PHI; they also contain PHI that are ambiguous with non-PHI. These complications are compounded by the linguistic characteristics of clinical records. For example, medical discharge summaries, which are studied in this paper, are characterized by fragmented, incomplete utterances and domain-specific language; they cannot be fully processed by tools designed for lay language. Methods and results: In this paper, we show that we can de-identify medical discharge summaries using a de-identifier, Stat De-id, based on support vector machines and local context (F-measure=97% on PHI). Our representation of local context aids de-identification even when PHI include out-of-vocabulary words and even when PHI are ambiguous with non-PHI within the same corpus. Comparison of Stat De-id with a rule-based approach shows that local context contributes more to de-identification than dictionaries combined with hand-tailored heuristics (F-measure=85%). Comparison with two well-known named entity recognition (NER) systems, SNoW (F-measure=94%) and IdentiFinder (F-measure=36%), on five representative corpora show that when the language of documents is fragmented, a system with a relatively thorough representation of local context can be a more effective de-identifier than systems that combine (relatively simpler) local context with global context. Comparison with a Conditional Random Field De-identifier (CRFD), which utilizes global context in addition to the local context of Stat De-id, confirms this finding (F-measure=88%) and establishes that strengthening the representation of local context may be more beneficial for de-identification than complementing local with global context.