Strategies for retrieving plagiarized documents

Authors:
Benno Stein;Sven Meyer zu Eissen;Martin Potthast
Affiliations:
Bauhaus University Weimar, Weimar, Germany;Bauhaus University Weimar, Weimar, Germany;Bauhaus University Weimar, Weimar, Germany
Venue:
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2007

Citing 3
Cited 12

Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Indexing shared content in information retrieval systems

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
A practical minimal perfect hashing method

WEA'05 Proceedings of the 4th international conference on Experimental and Efficient Algorithms

Application of Information Retrieval Techniques for Source Code Authorship Attribution

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
A Wikipedia-based multilingual retrieval model

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Self-taught hashing for fast similarity search

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Plagiarism detection across distant language pairs

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Error-correcting output hashing in fast similarity search

ICIMCS '10 Proceedings of the Second International Conference on Internet Multimedia Computing and Service
Filtering artificial texts with statistical machine learning techniques

Language Resources and Evaluation
Cross-language plagiarism detection

Language Resources and Evaluation
Information retrieval techniques for corpus filtering applied to external plagiarism detection

NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
Probabilistic near-duplicate detection using simhash

Proceedings of the 20th ACM international conference on Information and knowledge management
Word length n-grams for text re-use detection

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Semi-supervised spectral hashing for fast similarity search

Neurocomputing
Plagiarism Detection for Indonesian Texts

Proceedings of International Conference on Information Integration and Web-based Applications & Services

Quantified Score

Hi-index	0.00

Visualization

Abstract

For the identification of plagiarized passages in large document collections we present retrieval strategies which rely on stochastic sampling and chunk indexes. Using the entire Wikipedia corpus we compile n-gram indexes and compare them to a new kind of fingerprint index in a plagiarism analysis use case. Our index provides an analysis speed-up by factor 1.5 and is an order of magnitude smaller, while being equivalent in terms of precision and recall.