Copy detection mechanisms for digital documents

Authors:
Sergey Brin;James Davis;Héctor García-Molina
Affiliations:
Department of Computer Science, Stanford University, Stanford, CA;Department of Computer Science, Stanford University, Stanford, CA;Department of Computer Science, Stanford University, Stanford, CA
Venue:
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Year:
1995

Citing 1
Cited 155

Encryption and Secure Computer Networks

ACM Computing Surveys (CSUR)

Information finding in a digital library: the Stanford perspective

ACM SIGMOD Record
Building a scalable and accurate copy detection mechanism

Proceedings of the first ACM international conference on Digital libraries
Estimating alphanumeric selectivity in the presence of wildcards

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Indexing and retrieval of scientific literature

Proceedings of the eighth international conference on Information and knowledge management
CHECK: a document plagiarism detection system

SAC '97 Proceedings of the 1997 ACM symposium on Applied computing
Forensic engineering techniques for VLSI CAD tools

Proceedings of the 37th Annual Design Automation Conference
Copy detection for intellectual property protection of VLSI designs

ICCAD '99 Proceedings of the 1999 IEEE/ACM international conference on Computer-aided design
Hardware metering

Proceedings of the 38th annual Design Automation Conference
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Watermarking of Electronic Text Documents

Electronic Commerce Research
A Comparison of Text-Based Methods for Detecting Duplication in Scanned Document Databases

Information Retrieval
Clustering for Approximate Similarity Search in High-Dimensional Spaces

IEEE Transactions on Knowledge and Data Engineering
Filtering with Approximate Predicates

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Duplicate Removal in Information System Dissemination

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Efficient Snapshot Differential Algorithms for Data Warehousing

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Computational Forensic Techniques for Intellectual Property Protection

IHW '01 Proceedings of the 4th International Workshop on Information Hiding
Intellectual Property Metering

IHW '01 Proceedings of the 4th International Workshop on Information Hiding
Identifying and Filtering Near-Duplicate Documents

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Web Information Retrieval - an Algorithmic Perspective

ESA '00 Proceedings of the 8th Annual European Symposium on Algorithms
Fingerprinting Text in Logical Markup Languages

ISC '01 Proceedings of the 4th International Conference on Information Security
An Architecture of a Web-Based Collaborative Image Search Engine

On the Move to Meaningful Internet Systems, 2002 - DOA/CoopIS/ODBASE 2002 Confederated International Conferences DOA, CoopIS and ODBASE 2002
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Algorithmic aspects of information retrieval on the web

Handbook of massive data sets
Challenges in web search engines

ACM SIGIR Forum
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
TreeJuxtaposer: scalable tree comparison using Focus+Context with guaranteed visibility

ACM SIGGRAPH 2003 Papers
Online duplicate document detection: signature reliability in a dynamic retrieval environment

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Plagiarism detection of text using knowledge-based techniques

Design and application of hybrid intelligent systems
Improved robustness of signature-based near-replica detection via lexicon randomization

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Fast Approximate Similarity Search in Extremely High-Dimensional Data Sets

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
User-centric Web crawling

WWW '05 Proceedings of the 14th international conference on World Wide Web
LSH forest: self-tuning indexes for similarity search

WWW '05 Proceedings of the 14th international conference on World Wide Web
Near-duplicate detection for eRulemaking

dg.o '05 Proceedings of the 2005 national conference on Digital government research
K-gram based software birthmarks

Proceedings of the 2005 ACM symposium on Applied computing
Finding similar files in large document repositories

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Similarity measures for tracking information flow

Proceedings of the 14th ACM international conference on Information and knowledge management
Redundant documents and search effectiveness

Proceedings of the 14th ACM international conference on Information and knowledge management
Managing déjà vu: Collection building for the identification of nonidentical duplicate documents

Journal of the American Society for Information Science and Technology - Research Articles
WebKhoj: Indian language IR from multiple character encodings

Proceedings of the 15th international conference on World Wide Web
Next steps in near-duplicate detection for eRulemaking

dg.o '06 Proceedings of the 2006 international conference on Digital government research
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Near-duplicate detection by instance-level constrained clustering

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Just-in-time recovery of missing web pages

Proceedings of the seventeenth conference on Hypertext and hypermedia
A Dual-Method Model for Copy Detection

WI-IATW '06 Proceedings of the 2006 IEEE/WIC/ACM international conference on Web Intelligence and Intelligent Agent Technology
Improving duplicate elimination in storage systems

ACM Transactions on Storage (TOS)
Accurate discovery of co-derivative documents via duplicate text detection

Information Systems
Do not crawl in the dust: different urls with similar text

Proceedings of the 16th international conference on World Wide Web
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
EPCI: extracting potentially copyright infringement texts from the web

Proceedings of the 16th international conference on World Wide Web
Deducing similarities in Java sources from bytecodes

ATEC '98 Proceedings of the annual conference on USENIX Annual Technical Conference
Distributed text retrieval from overlapping collections

ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Essential deduplication functions for transactional databases in law firms

Proceedings of the 11th international conference on Artificial intelligence and law
Multiple-signal duplicate detection for search evaluation

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Content-based document routing and index partitioning for scalable similarity-based searches in a large corpus

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Combinatorial algorithms for web search engines: three success stories

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
A Sketch Algorithm for Estimating Two-Way and Multi-Way Associations

Computational Linguistics
Computer-based plagiarism detection methods and tools: an overview

CompSysTech '07 Proceedings of the 2007 international conference on Computer systems and technologies
Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Measuring novelty and redundancy with multiple modalities in cross-lingual broadcast news

Computer Vision and Image Understanding
Generating links by mining quotations

Proceedings of the nineteenth ACM conference on Hypertext and hypermedia
SpotSigs: robust and efficient near duplicate detection in large web collections

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Local text reuse detection

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Improving web information indexing and retrieval based on center block duplication detection

International Journal of Innovative Computing and Applications
Lexicon randomization for near-duplicate detection with I-Match

The Journal of Supercomputing
Identifying Quotations in Reference Works and Primary Materials

ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries
Plagiarism Detection Based on Singular Value Decomposition

GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Design of a P2P infrastructure to support plagiarism detection mechanisms

CSTST '08 Proceedings of the 5th international conference on Soft computing as transdisciplinary science and technology
Achieving both high precision and high recall in near-duplicate detection

Proceedings of the 17th ACM conference on Information and knowledge management
Do not crawl in the DUST: Different URLs with similar text

ACM Transactions on the Web (TWEB)
Sparse indexing: large scale, inline deduplication using sampling and locality

FAST '09 Proccedings of the 7th conference on File and storage technologies
Detecting the origin of text segments efficiently

Proceedings of the 18th international conference on World wide web
Efficient overlap and content reuse detection in blogs and online news articles

Proceedings of the 18th international conference on World wide web
The design of a similarity based deduplication system

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Applying syntactic similarity algorithms for enterprise information management

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Reengineering the Wikipedia for Reputation

Electronic Notes in Theoretical Computer Science (ENTCS)
Challenges in web search engines

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Automatic retrieval of similar content using search engine query interface

Proceedings of the 18th ACM conference on Information and knowledge management
Multilayer SOM with tree-structured data for efficient document retrieval and plagiarism detection

IEEE Transactions on Neural Networks
Exploiting Sentence-Level Features for Near-Duplicate Document Detection

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Density analysis of winnowing on non-uniform distributions

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
b-Bit minwise hashing

Proceedings of the 19th international conference on World wide web
Differences and identities in document retrieval in an annotation environment

DNIS'07 Proceedings of the 5th international conference on Databases in networked information systems
Organizing news archives by near-duplicate copy detection in digital libraries

ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Detecting near-duplicates in large-scale short text databases

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Is this a good title?

Proceedings of the 21st ACM conference on Hypertext and hypermedia
Wiki trust metrics based on phrasal analysis

WikiSym '08 Proceedings of the 4th International Symposium on Wikis
I/O Deduplication: Utilizing content similarity to improve I/O performance

ACM Transactions on Storage (TOS)
Efficient privacy-preserving similar document detection

The VLDB Journal — The International Journal on Very Large Data Bases
A GPU accelerated storage system

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
I/O deduplication: utilizing content similarity to improve I/O performance

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Bimodal content defined chunking for backup streams

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
A proposal for an effective information flow control model for sharing and protecting sensitive information

AISC '09 Proceedings of the Seventh Australasian Conference on Information Security - Volume 98
A coarse-to-fine framework to efficiently thwart plagiarism

Pattern Recognition
Finding inner copy communities using social network analysis

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part II
Automatic detection of local reuse

EC-TEL'10 Proceedings of the 5th European conference on Technology enhanced learning conference on Sustaining TEL: from innovation to learning and practice
A framework for corroborating answers from multiple web sources

Information Systems
Facilitating interaction and retrieval for annotated documents

International Journal of Computational Science and Engineering
An evolutionary neural network approach to intrinsic plagiarism detection

AICS'09 Proceedings of the 20th Irish conference on Artificial intelligence and cognitive science
High throughput data redundancy removal algorithm with scalable performance

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Fixing the threshold for effective detection of near duplicate web documents in web crawling

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Real-time approximate Range Motif discovery & data redundancy removal algorithm

Proceedings of the 14th International Conference on Extending Database Technology
Tradeoffs in scalable data routing for deduplication clusters

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Intrinsic plagiarism analysis

Language Resources and Evaluation
Cross-language plagiarism detection

Language Resources and Evaluation
Developing a corpus of plagiarised short answers

Language Resources and Evaluation
Plagiarism detection among source codes using adaptive local alignment of keywords

Proceedings of the 5th International Conference on Ubiquitous Information Management and Communication
Federated Search

Foundations and Trends in Information Retrieval
A driver-layer caching policy for removable storage devices

ACM Transactions on Storage (TOS)
Reuse in the wild: an empirical and ethnographic study of organizational content reuse

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
On sociomaterial imbrications: What plagiarism detection systems reveal and why it matters

Information and Organization
Large-scale copy detection

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
Retrieving similar documents from the web

Journal of Web Engineering
SizeSpotSigs: an effective deduplicate algorithm considering the size of page content

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Citation pattern matching algorithms for citation-based plagiarism detection: greedy citation tiling, citation chunking and longest common citation sequence

Proceedings of the 11th ACM symposium on Document engineering
A text copy detection system based on complex event processing architecture

ServiceWave'10 Proceedings of the 2010 international conference on Towards a service-based internet
Partial duplicate detection for large book collections

Proceedings of the 20th ACM international conference on Information and knowledge management
CoDet: sentence-based containment detection in news corpora

Proceedings of the 20th ACM international conference on Information and knowledge management
Mining relational structure from millions of books: position paper

Proceedings of the 4th ACM workshop on Online books, complementary social media and crowdsourcing
Detection of near-duplicate user generated contents: the SMS spam collection

Proceedings of the 3rd international workshop on Search and mining user-generated contents
Measuring redundancy level on the web

AINTEC '11 Proceedings of the 7th Asian Internet Engineering Conference
A systematic study of parameter correlations in large scale duplicate document detection

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Enhancing duplicate collection detection through replica boundary discovery

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Compact features for detection of near-duplicates in distributed retrieval

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
The case of the duplicate documents measurement, search, and science

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Document copy detection system based on plagiarism patterns

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
Using word clusters to detect similar web documents

KSEM'06 Proceedings of the First international conference on Knowledge Science, Engineering and Management
Temporal shingling for version identification in web archives

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
A sentence-based copy detection approach for web documents

FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part I
PPChecker: plagiarism pattern checker in document copy detection

TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue
Updating broken web links: An automatic recommendation system

Information Processing and Management: an International Journal
An improved plagiarism detection scheme based on semantic role labeling

Applied Soft Computing
Clustering near-identical sequences for fast homology search

RECOMB'06 Proceedings of the 10th annual international conference on Research in Computational Molecular Biology
Fast discovery of similar sequences in large genomic collections

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Intrinsic plagiarism detection

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Korean documents copy detection based on ferret

ICIC'11 Proceedings of the 7th international conference on Advanced Intelligent Computing
A fusion of algorithms in near duplicate document detection

PAKDD'11 Proceedings of the 15th international conference on New Frontiers in Applied Data Mining
iDedup: latency-aware, inline data deduplication for primary storage

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Replica-aware caching for Web proxies

Computer Communications
Multi-resolution similarity hashing

Digital Investigation: The International Journal of Digital Forensics & Incident Response
Detecting quilted web pages at scale

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Analysis and extraction of sentence-level paraphrase sub-corpus in CS education

Proceedings of the 13th annual conference on Information technology education
WAN-optimized replication of backup datasets using stream-informed delta compression

ACM Transactions on Storage (TOS)
Measuring semantic relatedness using multilingual representations

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Detecting text reuse with modified and weighted n-grams

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Online plagiarism detection through exploiting lexical, syntactic, and semantic information

ACL '12 Proceedings of the ACL 2012 System Demonstrations
Probabilistic deduplication for cluster-based storage systems

Proceedings of the Third ACM Symposium on Cloud Computing
Learning to rank duplicate bug reports

Proceedings of the 21st ACM international conference on Information and knowledge management
A test collection to evaluate plagiarism by missing or incorrect references

CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics
Space savings and design considerations in variable length deduplication

ACM SIGOPS Operating Systems Review
Detecting near-duplicate documents using sentence-level features and supervised learning

Expert Systems with Applications: An International Journal
Robust plagiary detection using semantic compression augmented SHAPD

ICCCI'12 Proceedings of the 4th international conference on Computational Collective Intelligence: technologies and applications - Volume Part I
An application for plagiarized source code detection based on a parse tree kernel

Engineering Applications of Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

In a digital library system, documents are available in digital form and therefore are more easily copied and their copyrights are more easily violated. This is a very serious problem, as it discourages owners of valuable information from sharing it with authorized users. There are two main philosophies for addressing this problem: prevention and detection. The former actually makes unauthorized use of documents difficult or impossible while the latter makes it easier to discover such activity.In this paper we propose a system for registering documents and then detecting copies, either complete copies or partial copies. We describe algorithms for such detection, and metrics required for evaluating detection mechanisms (covering accuracy, efficiency, and security). We also describe a working prototype, called COPS, describe implementation issues, and present experimental results that suggest the proper settings for copy detection parameters.