Accurate discovery of co-derivative documents via duplicate text detection

Authors:
Yaniv Bernstein;Justin Zobel
Affiliations:
School of Computer Science and Information Technology, RMIT University, Melbourne, Australia;School of Computer Science and Information Technology, RMIT University, Melbourne, Australia
Venue:
Information Systems
Year:
2006

Citing 16
Cited 5

Source models for natural language text

International Journal of Man-Machine Studies
Overview of the second text retrieval conference (TREC-2)

TREC-2 Proceedings of the second conference on Text retrieval conference
Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Browsing in digital libraries: a phrase-based approach

DL '97 Proceedings of the second ACM international conference on Digital libraries
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Modern Information Retrieval

Modern Information Retrieval
Detecting similar documents using salient terms

Proceedings of the eleventh international conference on Information and knowledge management
Computing Iceberg Queries Efficiently

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Finding Near-Replicas of Documents and Servers on the Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data

Local text reuse detection

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Exploiting Sentence-Level Features for Near-Duplicate Document Detection

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Efficient indexing of repeated n-grams

Proceedings of the fourth ACM international conference on Web search and data mining
Computing n-gram statistics in MapReduce

Proceedings of the 16th International Conference on Extending Database Technology
Indexing Word Sequences for Ranked Retrieval

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other, or some portion of both must be derived from a third document. An existing technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences, or chunks. Fingerprinting is hampered by an inability to accurately isolate information that is useful in identifying co-derivatives. In this paper we present SPEX, a novel hash-based algorithm for extracting duplicated chunks from a document collection. We discuss how information about shared chunks can be used for efficiently and reliably identifying co-derivative clusters, and describe DECO, a prototype package that combines the SPEX algorithm with other optimisations and compressed indexing to produce a flexible and scalable co-derivative discovery system. Our experiments with multi-gigabyte document collections demonstrate the effectiveness of the approach.