Finding near-duplicate web pages: a large-scale evaluation of algorithms

Authors:
Monika Henzinger
Affiliations:
Google Inc. & Ecole Féédérale de Lausanne (EPFL)
Venue:
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2006

Citing 9
Cited 91

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Building a scalable and accurate copy detection mechanism

Proceedings of the first ACM international conference on Digital libraries
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Finding Near-Replicas of Documents and Servers on the Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
On the Evolution of Clusters of Near-Duplicate Web Pages

LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6

Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Multiple-signal duplicate detection for search evaluation

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Principles of hash-based text retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Tracking multiple topics for finding interesting articles

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Scalable near identical image and shot detection

Proceedings of the 6th ACM international conference on Image and video retrieval
Combinatorial algorithms for web search engines: three success stories

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Tracking Web spam with HTML style similarities

ACM Transactions on the Web (TWEB)
Opinion spam and analysis

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
iRobot: an intelligent crawler for web forums

Proceedings of the 17th international conference on World Wide Web
Exploring traversal strategy for web forum crawling

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
SpotSigs: robust and efficient near duplicate detection in large web collections

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Local text reuse detection

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
De-duping URLs via rewrite rules

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Quantitative comparisons of search engine results

Journal of the American Society for Information Science and Technology
Trusting spam reporters: A reporter-based reputation system for email filtering

ACM Transactions on Information Systems (TOIS)
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Achieving both high precision and high recall in near-duplicate detection

Proceedings of the 17th ACM conference on Information and knowledge management
Finding text reuse on the web

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Detecting the origin of text segments efficiently

Proceedings of the 18th international conference on World wide web
IRLbot: Scaling to 6 billion pages and beyond

ACM Transactions on the Web (TWEB)
Leveraging discarded samples for tighter estimation of multiple-set aggregates

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Applying syntactic similarity algorithms for enterprise information management

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Near-duplicate detection for web-forums

IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Automatic retrieval of similar content using search engine query interface

Proceedings of the 18th ACM conference on Information and knowledge management
URL normalization for de-duplication of web pages

Proceedings of the 18th ACM conference on Information and knowledge management
NEAR-Miner: mining evolution associations of web site directories for efficient maintenance of web archives

Proceedings of the VLDB Endowment
Exploiting Sentence-Level Features for Near-Duplicate Document Detection

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Learning URL patterns for webpage de-duplication

Proceedings of the third ACM international conference on Web search and data mining
Detecting visually similar Web pages: Application to phishing detection

ACM Transactions on Internet Technology (TOIT)
Fast approximate duplicate detection for 2D-NMR spectra

DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
A pattern tree-based approach to learning URL normalization rules

Proceedings of the 19th international conference on World wide web
Organizing news archives by near-duplicate copy detection in digital libraries

ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Detecting near-duplicates in large-scale short text databases

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Weighted shingling: an adaptation of shingling for weighted shingles

IIT'09 Proceedings of the 6th international conference on Innovations in information technology
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Is this a good title?

Proceedings of the 21st ACM conference on Hypertext and hypermedia
Self-taught hashing for fast similarity search

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Adaptive near-duplicate detection via similarity learning

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Efficient partial-duplicate detection based on sequence matching

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Measuring the interestingness of articles in a limited user environment

Information Processing and Management: an International Journal
Simple and efficient algorithm for approximate dictionary matching

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Large scale parallel document mining for machine translation

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
An efficient similarity join algorithm with cosine similarity predicate

DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part II
Hypergeometric language model and zipf-like scoring function for web document similarity retrieval

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Learning website hierarchies for keyword enrichment in contextual advertising

Proceedings of the fourth ACM international conference on Web search and data mining
Fixing the threshold for effective detection of near duplicate web documents in web crawling

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Detecting near-duplicate relations in user generated forum content

OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems
Exponential time improvement for min-wise based algorithms

Information and Computation
Intrinsic plagiarism analysis

Language Resources and Evaluation
PRESIDIO: A Framework for Efficient Archival Data Storage

ACM Transactions on Storage (TOS)
ATLAS: a probabilistic algorithm for high dimensional similarity search

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Large-scale copy detection

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
Hypergeometric language models for republished article finding

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Query by document via a decomposition-based two-level retrieval approach

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
SizeSpotSigs: an effective deduplicate algorithm considering the size of page content

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Fast locality-sensitive hashing

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
PG-join: proximity graph based string similarity joins

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Efficient duplicate detection on cloud using a new signature scheme

WAIM'11 Proceedings of the 12th international conference on Web-age information management
A Text Similarity Meta-Search Engine Based on Document Fingerprints and Search Results Records

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Partial duplicate detection for large book collections

Proceedings of the 20th ACM international conference on Information and knowledge management
Probabilistic near-duplicate detection using simhash

Proceedings of the 20th ACM international conference on Information and knowledge management
Measuring redundancy level on the web

AINTEC '11 Proceedings of the 7th Asian Internet Engineering Conference
Temporal shingling for version identification in web archives

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Exponential time improvement for min-wise based algorithms

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
Bayesian locality sensitive hashing for fast similarity search

Proceedings of the VLDB Endowment
Keeping Found Things Found: The Study and Practice of Personal Information Management: The Study and Practice of Personal Information Management

Keeping Found Things Found: The Study and Practice of Personal Information Management: The Study and Practice of Personal Information Management
A fusion of algorithms in near duplicate document detection

PAKDD'11 Proceedings of the 15th international conference on New Frontiers in Applied Data Mining
FoCUS: learning to crawl web forums

Proceedings of the 21st international conference companion on World Wide Web
Clustering and load balancing optimization for redundant content removal

Proceedings of the 21st international conference companion on World Wide Web
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors

Proceedings of the VLDB Endowment
Exploring temporal evidence in web information retrieval

FDIA'07 Proceedings of the 1st BCS IRSG conference on Future Directions in Information Access
CRSI: a compact randomized similarity index for set-valued features

Proceedings of the 15th International Conference on Extending Database Technology
Multi-resolution similarity hashing

Digital Investigation: The International Journal of Digital Forensics & Incident Response
Index maintenance for time-travel text search

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Detecting quilted web pages at scale

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Efficient range queries over uncertain strings

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Learning to rank duplicate bug reports

Proceedings of the 21st ACM international conference on Information and knowledge management
Semi-supervised spectral hashing for fast similarity search

Neurocomputing
Rank hash similarity for fast similarity search

Information Processing and Management: an International Journal
Detecting near-duplicate documents using sentence-level features and supervised learning

Expert Systems with Applications: An International Journal
Reducing information redundancy in search results

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Scalable all-pairs similarity search in metric spaces

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Groundhog day: near-duplicate detection on Twitter

Proceedings of the 22nd international conference on World Wide Web
Bottom-k and priority sampling, set similarity and subset sums with minimal independence

Proceedings of the forty-fifth annual ACM symposium on Theory of computing
Near duplicate detection in an academic digital library

Proceedings of the 2013 ACM symposium on Document engineering
A pattern-based selective recrawling approach for object-level vertical search

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Streaming similarity search over one billion tweets using parallel locality-sensitive hashing

Proceedings of the VLDB Endowment
Efficient estimation for high similarities using odd sketches

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Broder et al.'s [3] shingling algorithm and Charikar's [4] random projection based approach are considered "state-of-the-art" algorithms for finding near-duplicate web pages. Both algorithms were either developed at or used by popular web search engines. We compare the two algorithms on a very large scale, namely on a set of 1.6B distinct web pages. The results show that neither of the algorithms works well for finding near-duplicate pairs on the same site, while both achieve high precision for near-duplicate pairs on different sites. Since Charikar's algorithm finds more near-duplicate pairs on different sites, it achieves a better precision overall, namely 0.50 versus 0.38 for Broder et al.'s algorithm. We present a combined algorithm which achieves precision 0.79 with 79% of the recall of the other algorithms.