Building a scalable and accurate copy detection mechanism

Authors:
Narayanan Shivakumar;Hector Garcia-Molina
Affiliations:
Department of Computer Science, Stanford, CA;Department of Computer Science, Stanford, CA
Venue:
Proceedings of the first ACM international conference on Digital libraries
Year:
1996

Citing 8
Cited 43

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
New indices for text: PAT Trees and PAT arrays

Information retrieval
The state of retrieval system evaluation

Information Processing and Management: an International Journal - Special issue on evaluation issues in information retrieval
Plagiarism in the web

Communications of the ACM
Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Suffix arrays: a new method for on-line string searches

SODA '90 Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms
Encryption and Secure Computer Networks

ACM Computing Surveys (CSUR)
Adaptive sentence boundary disambiguation

ANLC '94 Proceedings of the fourth conference on Applied natural language processing

Wave-indices: indexing evolving databases

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Copy detection for intellectual property protection of VLSI designs

ICCAD '99 Proceedings of the 1999 IEEE/ACM international conference on Computer-aided design
Agglomerative clustering of a search engine query log

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficiency of data structures for detecting overlaps in digital documents

ACSC '01 Proceedings of the 24th Australasian conference on Computer science
Signature extraction for overlap detection in documents

ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4
Watermarking of Electronic Text Documents

Electronic Commerce Research
Finding Interesting Associations without Support Pruning

IEEE Transactions on Knowledge and Data Engineering
Comparison of Overlap Detection Techniques

ICCS '02 Proceedings of the International Conference on Computational Science-Part I
Parallel and Distributed Document Overlap Detection on the Web

PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
Filtering with Approximate Predicates

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Computing Iceberg Queries Efficiently

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Identifying and Filtering Near-Duplicate Documents

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Fingerprinting Text in Logical Markup Languages

ISC '01 Proceedings of the 4th International Conference on Information Security
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Content and expression-based copy recognition for intellectual property protection

Proceedings of the 3rd ACM workshop on Digital rights management
LSH forest: self-tuning indexes for similarity search

WWW '05 Proceedings of the 14th international conference on World Wide Web
Comparison of texts streams in the presence of mild adversaries

ACSW Frontiers '05 Proceedings of the 2005 Australasian workshop on Grid computing and e-research - Volume 44
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A Dual-Method Model for Copy Detection

WI-IATW '06 Proceedings of the 2006 IEEE/WIC/ACM international conference on Web Intelligence and Intelligent Agent Technology
Deducing similarities in Java sources from bytecodes

ATEC '98 Proceedings of the annual conference on USENIX Annual Technical Conference
Multiple-signal duplicate detection for search evaluation

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Combinatorial algorithms for web search engines: three success stories

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Plagiarism Detection Based on Singular Value Decomposition

GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Large scale image copy detection evaluation

MIR '08 Proceedings of the 1st ACM international conference on Multimedia information retrieval
The design of a similarity based deduplication system

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Applying syntactic similarity algorithms for enterprise information management

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic retrieval of similar content using search engine query interface

Proceedings of the 18th ACM conference on Information and knowledge management
Multilayer SOM with tree-structured data for efficient document retrieval and plagiarism detection

IEEE Transactions on Neural Networks
Connection network and optimization of interest metric for one-to-one marketing

GECCO'03 Proceedings of the 2003 international conference on Genetic and evolutionary computation: PartII
Differences and identities in document retrieval in an annotation environment

DNIS'07 Proceedings of the 5th international conference on Databases in networked information systems
Efficient privacy-preserving similar document detection

The VLDB Journal — The International Journal on Very Large Data Bases
A coarse-to-fine framework to efficiently thwart plagiarism

Pattern Recognition
Detection of simple plagiarism in computer science papers

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Facilitating interaction and retrieval for annotated documents

International Journal of Computational Science and Engineering
Developing a corpus of plagiarised short answers

Language Resources and Evaluation
Enhancing duplicate collection detection through replica boundary discovery

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Threading software watermarks

IH'04 Proceedings of the 6th international conference on Information Hiding
PPChecker: plagiarism pattern checker in document copy detection

TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue
A fusion of algorithms in near duplicate document detection

PAKDD'11 Proceedings of the 15th international conference on New Frontiers in Applied Data Mining
Multi-resolution similarity hashing

Digital Investigation: The International Journal of Digital Forensics & Incident Response
Optimizing parallel algorithms for all pairs similarity search

Proceedings of the sixth ACM international conference on Web search and data mining
Cache-conscious performance optimization for similarity search

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Often, publishers are reluctant to offer valuable digital documentson the Internet for fear that they will be re-transmitted or copiedwidely. A Copy Detection Mechanism can help identify such copying.For example, publishers may register their documents with a copydetection server, and the server can then automatically checkpublic sources such as UseNet articles and Web sites for potentialillegal copies. The server can search for exact copies, and alsofor cases where significant portions of documents have been copied.In this paper we study, for the first time, the performance ofvarious copy detection mechanisms, including the disk storagerequirements, main memory requirements, response times forregistration, and response time for querying. We also contrastperformance to the accuracy of the mechanisms (how well they detectpartial copies). The results are obtained using SCAM, anexperimental server we have implemented, and a collection of 50,000netnews articles.