Detecting sensitive information from textual documents: an information-theoretic approach

Authors:
David Sánchez;Montserrat Batet;Alexandre Viejo
Affiliations:
Departament d'Enginyeria Informàtica i Matemàtiques, UNESCO Chair in Data Privacy, Universitat Rovira i Virgili, Tarragona, Spain;Departament d'Enginyeria Informàtica i Matemàtiques, UNESCO Chair in Data Privacy, Universitat Rovira i Virgili, Tarragona, Spain;Departament d'Enginyeria Informàtica i Matemàtiques, UNESCO Chair in Data Privacy, Universitat Rovira i Virgili, Tarragona, Spain
Venue:
MDAI'12 Proceedings of the 9th international conference on Modeling Decisions for Artificial Intelligence
Year:
2012

Citing 13
Cited 1

Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

EMCL '01 Proceedings of the 12th European Conference on Machine Learning
A search engine for natural language applications

WWW '05 Proceedings of the 14th international conference on World Wide Web
Incorporating non-local information into information extraction systems by Gibbs sampling

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
The Google Similarity Distance

IEEE Transactions on Knowledge and Data Engineering
Efficient techniques for document sanitization

Proceedings of the 17th ACM conference on Information and knowledge management
Using information content to evaluate semantic similarity in a taxonomy

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
A methodology to learn ontological attributes from the Web

Data & Knowledge Engineering
Ontology-driven web-based semantic similarity

Journal of Intelligent Information Systems
Ontology-based information content computation

Knowledge-Based Systems
Automatic extraction of acronym definitions from the Web

Applied Intelligence
Content annotation for the semantic web: an automatic web-based approach

Knowledge and Information Systems
On the declassification of confidential documents

MDAI'11 Proceedings of the 8th international conference on Modeling decisions for artificial intelligence
Privacy protection of textual attributes through a semantic-based masking method

Information Fusion

A New Model to Compute the Information Content of Concepts from Taxonomic Knowledge

International Journal on Semantic Web & Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Whenever a document containing sensitive information needs to be made public, privacy-preserving measures should be implemented. Document sanitization aims at detecting sensitive pieces of information in text, which are removed or hidden prior publication. Even though methods detecting sensitive structured information like e-mails, dates or social security numbers, or domain specific data like disease names have been developed, the sanitization of raw textual data has been scarcely addressed. In this paper, we present a general-purpose method to automatically detect sensitive information from textual documents in a domain-independent way. Relying on the Information Theory and a corpus as large as the Web, it assess the degree of sensitiveness of terms according to the amount of information they provide. Preliminary results show that our method significantly improves the detection recall in comparison with approaches based on trained classifiers.