Automatic Detecting Documents Containing Personal Health Information

Authors:
Yunli Wang;Hongyu Liu;Liqiang Geng;Matthew S. Keays;Yonghua You
Affiliations:
Institute for Information Technology, National Research Council Canada, Canada;Institute for Information Technology, National Research Council Canada, Canada;Institute for Information Technology, National Research Council Canada, Canada;Institute for Information Technology, National Research Council Canada, Canada;Institute for Information Technology, National Research Council Canada, Canada
Venue:
AIME '09 Proceedings of the 12th Conference on Artificial Intelligence in Medicine: Artificial Intelligence in Medicine
Year:
2009

Citing 6
Cited 1

An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Text categorization by boosting automatically extracted concepts

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Personal health information management

Communications of the ACM - Personal information management
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
A de-identifier for medical discharge summaries

Artificial Intelligence in Medicine

Privacy measures for free text documents: bridging the gap between theory and practice

TrustBus'11 Proceedings of the 8th international conference on Trust, privacy and security in digital business

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the increasing usage of computers and Internet, personal health information (PHI) is distributed across multiple institutes and often scattered on multiple devices and stored in diverse formats. Non-traditional medical records such as emails and e-documents containing PHI are in a high risk of privacy leakage. We are facing the challenges of locating and managing PHI in the distributed environment. The goal of this study is to classify electronic documents into PHI and non-PHI. A supervised machine learning method was used for this text categorization task. Three classifiers: SVM, decision tree and Naive Bayesian were used and tested on three data sets. Lexical, semantic and syntactic features and their combinations were compared in terms of their effectiveness of classifying PHI documents. The results show that combining semantic and/or syntactic with lexical features is more effective than lexical features alone for PHI classification. The supervised machine learning method is effective in classifying documents into PHI and non-PHI.