t-Plausibility: Generalizing Words to Desensitize Text

Authors:
Balamurugan Anandan;Chris Clifton;Wei Jiang;Mummoorthy Murugesan;Pedro Pastrana-Camacho;Luo Si
Affiliations:
Department of Computer Science/ Purdue University/ 305 N University St/ West Lafayette/ IN 47907-2107/ USA. e-mail: banandan@purdue.edu;Department of Computer Science/ Purdue University/ 305 N University St/ West Lafayette/ IN 47907-2107/ USA. e-mail: clifton@purdue.edu;Department of Computer Science/ Missouri University of Science and Technology/ 310 Computer Science Building/ 500 W 15th St/ Rolla/ MO 65409-0350/ USA. e-mail: wjiang@mst.edu;Teradata/ 100 N Sepulveda Blvd/ El Segundo/ CA 92045/ USA. e-mail: Mummoorthy.Murugesan@teradata.com;Department of Computer Science/ Purdue University/ 305 N University St/ West Lafayette/ IN 47907-2107/ USA. e-mail: ppastran@purdue.edu;Department of Computer Science/ Purdue University/ 305 N University St/ West Lafayette/ IN 47907-2107/ USA. e-mail: lsi@purdue.edu
Venue:
Transactions on Data Privacy
Year:
2012

Citing 15
Cited 0

WordNet: a lexical database for English

Communications of the ACM
Natural language processing for information assurance and security: an overview and implementations

Proceedings of the 2000 workshop on New security paradigms
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
Protecting Respondents' Identities in Microdata Release

IEEE Transactions on Knowledge and Data Engineering
k-anonymity: a model for protecting privacy

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
Introduction to the special issue on word sense disambiguation: the state of the art

Computational Linguistics - Special issue on word sense disambiguation
A simple rule-based part of speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
\ell -Diversity: Privacy Beyond \kappa -Anonymity

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Privacy-preserving anonymization of set-valued data

Proceedings of the VLDB Endowment
Efficient techniques for document sanitization

Proceedings of the 17th ACM conference on Information and knowledge management
Word sense disambiguation: A survey

ACM Computing Surveys (CSUR)
t-Plausibility: Semantic Preserving Text Sanitization

CSE '09 Proceedings of the 2009 International Conference on Computational Science and Engineering - Volume 03
Search-log anonymization and advertisement: are they mutually exclusive?

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Significance of Term Relationships on Anonymization

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 03
Privacy-preserving distributed k-anonymity

DBSec'05 Proceedings of the 19th annual IFIP WG 11.3 working conference on Data and Applications Security

Quantified Score

Hi-index	0.00

Visualization

Abstract

De-identified data has the potential to be shared widely to support decision making and research. While significant advances have been made in anonymization of structured data, anonymization of textual information is in it infancy. Document sanitization requires finding and removing personally identifiable information. While current tools are effective at removing specific types of information (names, addresses, dates), they fail on two counts. The first is that complete text redaction may not be necessary to prevent re-identification, since this can affect the readability and usability of the text. More serious is that identifying information, as well as sensitive information, can be quite subtle and still be present in the text even after the removal of obvious identifiers. Observe that a diagnosis ``tuberculosis'' is sensitive, but in some situations it can also be identifying. Replacing it with the less sensitive term ``infectious disease'' also reduces identifiability. That is, instead of simply removing sensitive terms, these terms can be hidden by more general but semantically related terms to protect sensitive and identifying information, without unnecessarily degrading the amount of information contained in the document. Based on this observation, the main contribution of this paper is to provide a novel information theoretic approach to text sanitization and develop efficient heuristics to sanitize text documents.