Role of local context in automatic deidentification of ungrammatical, fragmented text

Authors:
Tawanda Sibanda;Ozlem Uzuner
Affiliations:
CSAIL, Massachusetts Institute of Technology, Cambridge, MA;University at Albany, SUNY, Albany, NY
Venue:
HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Year:
2006

Citing 9
Cited 6

An Algorithm that Learns What‘s in a Name

Machine Learning - Special issue on natural language learning
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Extracting the names of genes and gene products with a hidden Markov model

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Probabilistic reasoning for entity & relation recognition

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Using predicate-argument structures for information extraction

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Recognizing names in biomedical texts: a machine learning approach

Bioinformatics
Exploiting context for biomedical entity recognition: from syntax to the web

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Automatically generating extraction patterns from untagged text

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 2
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)

HIDE: heterogeneous information DE-identification

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
TEXT2TABLE: medical text summarization system based on named entity recognition and modality identification

BioNLP '09 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing
A shared task involving multi-label classification of clinical free text

BioNLP '07 Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing
An integrated framework for de-identifying unstructured medical data

Data & Knowledge Engineering
An evaluation of feature sets and sampling techniques for de-identification of medical records

Proceedings of the 1st ACM International Health Informatics Symposium
Automatic extraction of semantic content from medical discharge records

ICOSSE'06 Proceedings of the 5th WSEAS international conference on System science and simulation in engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Deidentification of clinical records is a crucial step before these records can be distributed to non-hospital researchers. Most approaches to deidentification rely heavily on dictionaries and heuristic rules; these approaches fail to remove most personal health information (PHI) that cannot be found in dictionaries. They also can fail to remove PHI that is ambiguous between PHI and non-PHI.Named entity recognition (NER) technologies can be used for deidentification. Some of these technologies exploit both local and global context of a word to identify its entity type. When documents are grammatically written, global context can improve NER.In this paper, we show that we can deidentify medical discharge summaries using support vector machines that rely on a statistical representation of local context. We compare our approach with three different systems. Comparison with a rule-based approach shows that a statistical representation of local context contributes more to deidentification than dictionaries and hand-tailored heuristics. Comparison with two well-known systems, SNoW and IdentiFinder, shows that when the language of documents is fragmented, local context contributes more to deidentification than global context.