Methods for identifying versioned and plagiarized documents

Authors:
Timothy C. Hoad;Justin Zobel
Affiliations:
School of Computer Science and Information Technology, RMIT University GPO Box 2476V, Melbourne 3001, Australia;School of Computer Science and Information Technology, RMIT University GPO Box 2476V, Melbourne 3001, Australia
Venue:
Journal of the American Society for Information Science and Technology
Year:
2003

Citing 15
Cited 75

Detection of similarities in student programs: YAP'ing may be preferable to plague'ing

SIGCSE '92 Proceedings of the twenty-third SIGCSE technical symposium on Computer science education
Overview of the second text retrieval conference (TREC-2)

TREC-2 Proceedings of the second conference on Text retrieval conference
Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Building a scalable and accurate copy detection mechanism

Proceedings of the first ACM international conference on Digital libraries
YAP3: improved detection of similarities in computer program and other texts

SIGCSE '96 Proceedings of the twenty-seventh SIGCSE technical symposium on Computer science education
Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
Exploring the similarity space

ACM SIGIR Forum
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Document overlap detection system for distributed digital libraries

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Effective ranking with arbitrary passages

Journal of the American Society for Information Science and Technology
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Finding Near-Replicas of Documents and Servers on the Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Performance in Practice of String Hashing Functions

Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA)

Video similarity detection for digital rights management

ACSC '03 Proceedings of the 26th Australasian computer science conference - Volume 16
Fast video matching with signature alignment

MIR '03 Proceedings of the 5th ACM SIGMM international workshop on Multimedia information retrieval
Improved robustness of signature-based near-replica detection via lexicon randomization

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Near-duplicate detection for eRulemaking

dg.o '05 Proceedings of the 2005 national conference on Digital government research
Sentence-based natural language plagiarism detection

Journal on Educational Resources in Computing (JERIC)
Similarity measures for tracking information flow

Proceedings of the 14th ACM international conference on Information and knowledge management
Redundant documents and search effectiveness

Proceedings of the 14th ACM international conference on Information and knowledge management
Phishing Webpage Detection

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Detection of video sequences using compact signatures

ACM Transactions on Information Systems (TOIS)
Managing déjà vu: Collection building for the identification of nonidentical duplicate documents

Journal of the American Society for Information Science and Technology - Research Articles
The methodology and an application to fight against Unicode attacks

SOUPS '06 Proceedings of the second symposium on Usable privacy and security
Next steps in near-duplicate detection for eRulemaking

dg.o '06 Proceedings of the 2006 international conference on Digital government research
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Near-duplicate detection by instance-level constrained clustering

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Desktop tools for offline plagiarism detection in computer programs

Informatics in education
Plagiarism detection across programming languages

ACSC '06 Proceedings of the 29th Australasian Computer Science Conference - Volume 48
Accurate discovery of co-derivative documents via duplicate text detection

Information Systems
Efficient plagiarism detection for large code repositories

Software—Practice & Experience
Do not crawl in the dust: different urls with similar text

Proceedings of the 16th international conference on World Wide Web
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
An approach to evaluate policy similarity

Proceedings of the 12th ACM symposium on Access control models and technologies
Distributed text retrieval from overlapping collections

ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Strategies for retrieving plagiarized documents

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
SpotSigs: robust and efficient near duplicate detection in large web collections

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Local text reuse detection

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Improving web information indexing and retrieval based on center block duplication detection

International Journal of Innovative Computing and Applications
Lexicon randomization for near-duplicate detection with I-Match

The Journal of Supercomputing
Identifying Quotations in Reference Works and Primary Materials

ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries
The Evaluation of Sentence Similarity Measures

DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
Achieving both high precision and high recall in near-duplicate detection

Proceedings of the 17th ACM conference on Information and knowledge management
Anti-plagiarism certification be an academic mandate

Journal of the American Society for Information Science and Technology
Do not crawl in the DUST: Different URLs with similar text

ACM Transactions on the Web (TWEB)
Detecting the origin of text segments efficiently

Proceedings of the 18th international conference on World wide web
Applying syntactic similarity algorithms for enterprise information management

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Exploiting Sentence-Level Features for Near-Duplicate Document Detection

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load

ACM Transactions on Information Systems (TOIS)
Organizing news archives by near-duplicate copy detection in digital libraries

ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Mining Query Logs: Turning Search Usage Data into Knowledge

Foundations and Trends in Information Retrieval
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Adaptive near-duplicate detection via similarity learning

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
A coarse-to-fine framework to efficiently thwart plagiarism

Pattern Recognition
Plagiarism detection across distant language pairs

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Detection of simple plagiarism in computer science papers

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Intrinsic plagiarism analysis

Language Resources and Evaluation
Cross-language plagiarism detection

Language Resources and Evaluation
SimPaD: A word-similarity sentence-based plagiarism detection tool on Web documents

Web Intelligence and Agent Systems
Comparative evaluation of text- and citation-based plagiarism detection approaches using guttenplag

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
SizeSpotSigs: an effective deduplicate algorithm considering the size of page content

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Citation pattern matching algorithms for citation-based plagiarism detection: greedy citation tiling, citation chunking and longest common citation sequence

Proceedings of the 11th ACM symposium on Document engineering
Partial duplicate detection for large book collections

Proceedings of the 20th ACM international conference on Information and knowledge management
Plagiarism detection based on structural information

Proceedings of the 20th ACM international conference on Information and knowledge management
Mining relational structure from millions of books: position paper

Proceedings of the 4th ACM workshop on Online books, complementary social media and crowdsourcing
Identifying information provenance in support of intelligence analysis, sharing, and protection

ISI'06 Proceedings of the 4th IEEE international conference on Intelligence and Security Informatics
Compact features for detection of near-duplicates in distributed retrieval

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Temporal shingling for version identification in web archives

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Intrinsic plagiarism detection

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Scalable sequence similarity search and join in main memory on multi-cores

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
A plagiarism detection system for arabic text-based documents

PAISI'12 Proceedings of the 2012 Pacific Asia conference on Intelligence and Security Informatics
Detecting quilted web pages at scale

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Finding translations in scanned book collections

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Measuring semantic relatedness using multilingual representations

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Detecting text reuse with modified and weighted n-grams

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Text reuse with ACL: (upward) trends

ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries
Learning to rank duplicate bug reports

Proceedings of the 21st ACM international conference on Information and knowledge management
Increasing recall for text re-use in historical documents to support research in the humanities

TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries
Experiments with filtered detection of similar academic papers

AIMSA'12 Proceedings of the 15th international conference on Artificial Intelligence: methodology, systems, and applications
Determining and characterizing the reused text for plagiarism detection

Expert Systems with Applications: An International Journal
Research on intrinsic plagiarism detection resolution: a supervised learning approach

CLSW'12 Proceedings of the 13th Chinese conference on Chinese Lexical Semantics
Reducing information redundancy in search results

Proceedings of the 28th Annual ACM Symposium on Applied Computing
VILO: a rapid learning nearest-neighbor classifier for malware triage

Journal in Computer Virology
Plagiarism Detection for Indonesian Texts

Proceedings of International Conference on Information Integration and Web-based Applications & Services
Multi-level sequence alignment: a trade-off between speed and accuracy in similar text searching

Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
CoBAn: A context based model for data leakage prevention

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

The widespread use of on-line publishing of text promotes storage of multiple versions of documents and mirroring of documents in multiple locations, and greatly simplifies the task of plagiarizing the work of others. We evaluate two families of methods for searching a collection to find documents that are coderivative, that is, are versions or plagiarisms of each other. The first, the ranking family, uses information retrieval techniques; extending this family, we propose the identity measure, which is specifically designed for identification of co-derivative documents. The second, the fingerprinting family, uses hashing to generate a compact document description, which can then be compared to the fingerprints of the documents in the collection. We introduce a new method for evaluating the effectiveness of these techniques, and demonstrate it in practice. Using experiments on two collections, we demonstrate that the identity measure and the best fingerprinting technique are both able to accurately identify coderivative documents. However, for fingerprinting parameters must be carefully chosen, and even so the identity measure is clearly superior.