Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL
EMCL '01 Proceedings of the 12th European Conference on Machine Learning
A search engine for natural language applications
WWW '05 Proceedings of the 14th international conference on World Wide Web
Incorporating non-local information into information extraction systems by Gibbs sampling
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
The Google Similarity Distance
IEEE Transactions on Knowledge and Data Engineering
Efficient techniques for document sanitization
Proceedings of the 17th ACM conference on Information and knowledge management
Using information content to evaluate semantic similarity in a taxonomy
IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
A methodology to learn ontological attributes from the Web
Data & Knowledge Engineering
Ontology-driven web-based semantic similarity
Journal of Intelligent Information Systems
Ontology-based information content computation
Knowledge-Based Systems
Automatic extraction of acronym definitions from the Web
Applied Intelligence
Content annotation for the semantic web: an automatic web-based approach
Knowledge and Information Systems
On the declassification of confidential documents
MDAI'11 Proceedings of the 8th international conference on Modeling decisions for artificial intelligence
A New Model to Compute the Information Content of Concepts from Taxonomic Knowledge
International Journal on Semantic Web & Information Systems
Hi-index | 0.00 |
Whenever a document containing sensitive information needs to be made public, privacy-preserving measures should be implemented. Document sanitization aims at detecting sensitive pieces of information in text, which are removed or hidden prior publication. Even though methods detecting sensitive structured information like e-mails, dates or social security numbers, or domain specific data like disease names have been developed, the sanitization of raw textual data has been scarcely addressed. In this paper, we present a general-purpose method to automatically detect sensitive information from textual documents in a domain-independent way. Relying on the Information Theory and a corpus as large as the Web, it assess the degree of sensitiveness of terms according to the amount of information they provide. Preliminary results show that our method significantly improves the detection recall in comparison with approaches based on trained classifiers.