dSCAM: finding document copies across multiple databases

Authors:
Héctor García-Molina;Luis Gravano;Narayanan Shivakumar
Affiliations:
-;-;-
Venue:
DIS '96 Proceedings of the fourth international conference on on Parallel and distributed information systems
Year:
1996

Citing 0
Cited 7

Plagiarism detection of text using knowledge-based techniques

Design and application of hybrid intelligent systems
Analysis of source identified text corpora: exploring the statistics of the reused text and authorship

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Do not crawl in the dust: different urls with similar text

Proceedings of the 16th international conference on World Wide Web
Do not crawl in the DUST: Different URLs with similar text

ACM Transactions on the Web (TWEB)
Large-scale copy detection

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Hypergeometric language models for republished article finding

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Retrieving similar documents from the web

Journal of Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The advent of the Internet has made the illegal dissemination of copyrighted material easy. An important problem is how to automatically detect when a “new” digital document is “suspiciously close” to existing ones. The SCAM project at Stanford University has addressed this problem when there is a single registered-document database. However, in practice, text documents may appear in many autonomous databases, and one would like to discover copies without having to exhaustively search in all databases. Our approach, dSCAM, is a distributed version of SCAM that keeps succinct metainformation about the contents of the available document databases. Given a suspicious document S, dSCAM uses its information to prune all databases that cannot contain any document that is close enough to S, and hence the search can focus on the remaining sites. We also study how to query the remaining databases so as to minimize different querying costs. We empirically study the pruning and searching schemes, using a collection of 50 databases and two sets of test documents.