Obfuscating document stylometry to preserve author anonymity

Authors:
Gary Kacmarcik;Michael Gamon
Affiliations:
Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA
Venue:
COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Year:
2006

Citing 5
Cited 7

Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Authorship verification as a one-class classification problem

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Can pseudonymity really guarantee privacy?

SSYM'00 Proceedings of the 9th conference on USENIX Security Symposium - Volume 9
A Bayesian approach to learning Bayesian networks with local structure

UAI'97 Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence

Authorship attribution

Foundations and Trends in Information Retrieval
Stylometric Identification in Electronic Markets: Scalability and Robustness

Journal of Management Information Systems
Mixed-initiative security agents

Proceedings of the 2nd ACM workshop on Security and artificial intelligence
Intrinsic plagiarism analysis

Language Resources and Evaluation
Use fewer instances of the letter "i": toward writing style anonymization

PETS'12 Proceedings of the 12th international conference on Privacy Enhancing Technologies
Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity

ACM Transactions on Information and System Security (TISSEC)
"Un-googling" publications: the ethics and problems of anonymization

CHI '13 Extended Abstracts on Human Factors in Computing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper explores techniques for reducing the effectiveness of standard authorship attribution techniques so that an author A can preserve anonymity for a particular document D. We discuss feature selection and adjustment and show how this information can be fed back to the author to create a new document D' for which the calculated attribution moves away from A. Since it can be labor intensive to adjust the document in this fashion, we attempt to quantify the amount of effort required to produce the anonymized document and introduce two levels of anonymization: shallow and deep. In our test set, we show that shallow anonymization can be achieved by making 14 changes per 1000 words to reduce the likelihood of identifying A as the author by an average of more than 83%. For deep anonymization, we adapt the unmasking work of Koppel and Schler to provide feedback that allows the author to choose the level of anonymization.