Detecting text reuse with modified and weighted n-grams

Authors:
Rao Muhammad Adeel Nawab;Mark Stevenson;Paul Clough
Affiliations:
University of Sheffield, UK;University of Sheffield, UK;University of Sheffield, UK
Venue:
SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Year:
2012

Citing 9
Cited 0

Elements of information theory

Elements of information theory
Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
METER: MEasuring TExt Reuse

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Local text reuse detection

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Syntactic constraints on paraphrases extracted from parallel corpora

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Evaluating text reuse discovery on the web

Proceedings of the third symposium on Information interaction in context

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text reuse is common in many scenarios and documents are often based, at least in part, on existing documents. This paper reports an approach to detecting text reuse which identifies not only documents which have been reused verbatim but is also designed to identify cases of reuse when the original has been rewritten. The approach identifies reuse by comparing word n-grams in documents and modifies these (by substituting words with synonyms and deleting words) to identify when text has been altered. The approach is applied to a corpus of newspaper stories and found to outperform a previously reported method.