The hiding virtues of ambiguity: quantifiably resilient watermarking of natural language text through synonym substitutions

Authors:
Umut Topkara;Mercan Topkara;Mikhail J. Atallah
Affiliations:
Purdue University;Purdue University;Purdue University
Venue:
MM&Sec '06 Proceedings of the 8th workshop on Multimedia and security
Year:
2006

Citing 12
Cited 19

Class-based n-gram models of natural language

Computational Linguistics
Natural language processing for information assurance and security: an overview and implementations

Proceedings of the 2000 workshop on New security paradigms
Plausible Deniability Using Automated Linguistic Stegonagraphy

InfraSec '02 Proceedings of the International Conference on Infrastructure Security
Natural Language Watermarking and Tamperproofing

IH '02 Revised Papers from the 5th International Workshop on Information Hiding
Power: A Metric for Evaluating Watermarking Algorithms

ITCC '02 Proceedings of the International Conference on Information Technology: Coding and Computing
A Framework for High-Accuracy Privacy-Preserving Mining

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Frequency estimates for statistical word similarity measures

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Rights Protection for Discrete Numeric Streams

IEEE Transactions on Knowledge and Data Engineering
Using information content to evaluate semantic similarity in a taxonomy

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
Efficient wet paper codes

IH'05 Proceedings of the 7th international conference on Information Hiding
Translation-based steganography

IH'05 Proceedings of the 7th international conference on Information Hiding
Information-theoretic analysis of information hiding

IEEE Transactions on Information Theory

Words are not enough: sentence level natural language watermarking

Proceedings of the 4th ACM international workshop on Contents protection and security
Passwords decay, words endure: secure and re-usable multiple password mnemonics

Proceedings of the 2007 ACM symposium on Applied computing
Natural language watermarking via morphosyntactic alterations

Computer Speech and Language
Text watermarking by syntactic analysis

ICCOMP'08 Proceedings of the 12th WSEAS international conference on Computers
Disappearing Cryptography: Information Hiding: Steganography & Watermarking

Disappearing Cryptography: Information Hiding: Steganography & Watermarking
Rights protection of trajectory datasets with nearest-neighbor preservation

The VLDB Journal — The International Journal on Very Large Data Bases
Comprehensive linguistic steganography survey

International Journal of Information and Computer Security
Linguistic steganography using automatically generated paraphrases

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Practical linguistic steganography using contextual synonym substitution and vertex colour coding

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Headstega: e-mail-headers-based steganography methodology

International Journal of Electronic Security and Digital Forensics
Steganalysis against substitution-based linguistic steganography based on context clusters

Computers and Electrical Engineering
Edustega: an Education-Centric Steganography methodology

International Journal of Security and Networks
UniSpaCh: A text-based data hiding method using Unicode space characters

Journal of Systems and Software
Adaptive-capacity and robust natural language watermarking for agglutinative languages

Security and Communication Networks
Detection of substitution-based linguistic steganography by relative frequency analysis

Digital Investigation: The International Journal of Digital Forensics & Incident Response
Text split-based steganography in OOXML format documents for covert communication

Security and Communication Networks
Natural language watermarking for german texts

Proceedings of the first ACM workshop on Information hiding and multimedia security
Copyright for web content using invisible text watermarking

Computers in Human Behavior
Content-Based Web Watermarking

International Journal of Knowledge Society Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information-hiding in natural language text has mainly consisted of carrying out approximately meaning-preserving modifications on the given cover text until it encodes the intended mark. A major technique for doing so has been synonym-substitution. In these previous schemes, synonym substitutions were done until the text "confessed", i.e., carried the intended mark message. We propose here a better way to use synonym substitution, one that is no longer entirely guided by the mark-insertion process: It is also guided by a resilience requirement, subject to a maximum allowed distortion constraint. Previous schemes for information hiding in natural language text did not use numeric quantification of the distortions introduced by transformations, they mainly used heuristic measures of quality based on conformity to a language model (and not in reference to the original cover text). When there are many alternatives to carry out a substitution on a word, we prioritize these alternatives according to a quantitative resilience criterion and use them in that order. In a nutshell, we favor the more ambiguous alternatives. In fact not only do we attempt to achieve the maximum ambiguity, but we want to simultaneously be as close as possible to the above-mentioned distortion limit, as that prevents the adversary from doing further transformations without exceeding the damage threshold; that is, we continue to modify the document even after the text has "confessed" to the mark, for the dual purpose of maximizing ambiguity while deliberately getting as close as possible to the distortion limit. The quantification we use makes possible an application of the existing information-theoretic framework, to the natural language domain, which has unique challenges not present in the image or audio domains. The resilience stems from both (i) the fact that the adversary does not know where the changes were made, and (ii) the fact that automated disambiguation is a major difficulty faced by any natural language processing system (what is bad news for the natural language processing area, is good news for our scheme's resilience). In addition to the above mentioned design and analysis, another contribution of this paper is the description of the implementation of the scheme and of the experimental data obtained.