Identifying and Filtering Near-Duplicate Documents

Authors:
Andrei Z. Broder
Affiliations:
-
Venue:
COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Year:
2000

Citing 10
Cited 53

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Building a scalable and accurate copy detection mechanism

Proceedings of the first ACM international conference on Digital libraries
Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Min-Wise versus linear independence (extended abstract)

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
The AltaVista Revolution: How to Find Anything on the Internet

The AltaVista Revolution: How to Find Anything on the Internet
Finding Near-Replicas of Documents and Servers on the Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference

A Derandomization Using Min-Wise Independent Permutations

RANDOM '98 Proceedings of the Second International Workshop on Randomization and Approximation Techniques in Computer Science
Frequent Itemset Mining for Clustering Near Duplicate Web Documents

ICCS '09 Proceedings of the 17th International Conference on Conceptual Structures: Conceptual Structures: Leveraging Semantic Technologies
Difference engine: harnessing memory redundancy in virtual machines

Communications of the ACM
Adaptive near-duplicate detection via similarity learning

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Efficient partial-duplicate detection based on sequence matching

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Efficient algorithms for large-scale local triangle counting

ACM Transactions on Knowledge Discovery from Data (TKDD)
Evaluating text reuse discovery on the web

Proceedings of the third symposium on Information interaction in context
Bucketing coding and information theory for the statistical high-dimensional nearest-neighbor problem

IEEE Transactions on Information Theory
Estimating set intersection using small samples

ACSC '10 Proceedings of the Thirty-Third Australasian Conferenc on Computer Science - Volume 102
PROSPECT: a system for screening candidates for recruitment

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Large scale parallel document mining for machine translation

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Feature map hashing: sub-linear indexing of appearance and global geometry

Proceedings of the international conference on Multimedia
Automatic detection of local reuse

EC-TEL'10 Proceedings of the 5th European conference on Technology enhanced learning conference on Sustaining TEL: from innovation to learning and practice
Optimal hash functions for approximate matches on the n-cube

IEEE Transactions on Information Theory
Detecting duplicate web documents using clickthrough data

Proceedings of the fourth ACM international conference on Web search and data mining
An evaluation framework for plagiarism detection

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Detecting near-duplicate relations in user generated forum content

OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems
Exponential time improvement for min-wise based algorithms

Information and Computation
Fourth international workshop on uncovering plagiarism, authorship, and social software misuse

ACM SIGIR Forum
Get the most out of your sample: optimal unbiased estimators using partial information

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
SizeSpotSigs: an effective deduplicate algorithm considering the size of page content

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Two ways to use a noisy parallel news corpus for improving statistical machine translation

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
GrawlTCQ: terminology and corpora building by ranking simultaneously terms, queries and documents using graph random walks

TextGraphs-6 Proceedings of TextGraphs-6: Graph-based Methods for Natural Language Processing
COCA filters: co-occurrence aware bloom filters

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Probabilistic near-duplicate detection using simhash

Proceedings of the 20th ACM international conference on Information and knowledge management
A nugget-based test collection construction paradigm

Proceedings of the 20th ACM international conference on Information and knowledge management
DeFFS: Duplication-eliminated flash file system

Computers and Electrical Engineering
Enhancing duplicate collection detection through replica boundary discovery

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Indexing shared content in information retrieval systems

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
IR system evaluation using nugget-based test collections

Proceedings of the fifth ACM international conference on Web search and data mining
How user behavior is related to social affinity

Proceedings of the fifth ACM international conference on Web search and data mining
Temporal shingling for version identification in web archives

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Exponential time improvement for min-wise based algorithms

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
Clustering and load balancing optimization for redundant content removal

Proceedings of the 21st international conference companion on World Wide Web
WAN optimized replication of backup datasets using stream-informed delta compression

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Survey: Urban pervasive applications: Challenges, scenarios and case studies

Computer Science Review
Learning hash codes for efficient content reuse detection

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
WAN-optimized replication of backup datasets using stream-informed delta compression

ACM Transactions on Storage (TOS)
A model of uncertainty for near-duplicates in document reference networks

ECDL'07 Proceedings of the 11th European conference on Research and Advanced Technology for Digital Libraries
Constructing test collections by inferring document relevance via extracted relevant information

Proceedings of the 21st ACM international conference on Information and knowledge management
Efficient jaccard-based diversity analysis of large document collections

Proceedings of the 21st ACM international conference on Information and knowledge management
Cross-Language high similarity search using a conceptual thesaurus

CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics
A webpage deletion algorithm based on hierarchical filtering

WISM'12 Proceedings of the 2012 international conference on Web Information Systems and Mining
An ontology derived from heterogeneous sustainability indicator set documents

Proceedings of the Seventeenth Australasian Document Computing Symposium
Detecting near-duplicate documents using sentence-level features and supervised learning

Expert Systems with Applications: An International Journal
A novel approach for leveraging co-occurrence to improve the false positive error in signature files

Journal of Discrete Algorithms
Near-Duplicate detection for online-shops owners: an FCA-Based approach

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Secure computation of functionalities based on Hamming distance and its application to computing document similarity

International Journal of Applied Cryptography
Reducing information redundancy in search results

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Synthetic review spamming and defense

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Bottom-k and priority sampling, set similarity and subset sums with minimal independence

Proceedings of the forty-fifth annual ACM symposium on Theory of computing
Towards large-scale geometry indexing by feature selection

Computer Vision and Image Understanding
External validity of sentiment mining reports: Can current methods identify demographic biases, event biases, and manipulation of reviews?

Decision Support Systems

Quantified Score

Hi-index	0.14

Visualization

Abstract

The mathematical concept of document resemblance captures well the informal notion of syntactic similarity. The resemblance can be estimated using a fixed size "sketch" for each document. For a large collection of documents (say hundreds of millions) the size of this sketch is of the order of a few hundred bytes per document. However, for effcient large scale web indexing it is not necessary to determine the actual resemblance value: it suffces to determine whether newly encountered documents are duplicates or near-duplicates of documents already indexed. In other words, it suffces to determine whether the resemblance is above a certain threshold. In this talk we show how this determination can be made using a "sample" of less than 50 bytes per document. The basic approach for computing resemblance has two aspects: first, resemblance is expressed as a set (of strings) intersection problem, and second, the relative size of intersections is evaluated by a process of random sampling that can be done independently for each document. The process of estimating the relative size of intersection of sets and the threshold test discussed above can be applied to arbitrary sets, and thus might be of independent interest. The algorithm for filtering near-duplicate documents discussed here has been successfully implemented and has been used for the last three years in the context of the AltaVista search engine.