Detecting near-duplicates for web crawling

Authors:
Gurmeet Singh Manku;Arvind Jain;Anish Das Sarma
Affiliations:
Google Inc., Mountain View, CA;Google Inc., Mountain View, CA;Stanford University, Stanford, CA
Venue:
Proceedings of the 16th international conference on World Wide Web
Year:
2007

Citing 44
Cited 68

A theory of parameterized pattern matching: algorithms and applications

STOC '93 Proceedings of the twenty-fifth annual ACM symposium on Theory of computing
Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Dictionary look-up with one error

Journal of Algorithms
Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Finding related pages in the World Wide Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Trawling the Web for emerging cyber-communities

WWW '99 Proceedings of the eighth international conference on World Wide Web
Mirror, mirror on the Web: a study of host pairs with replicated content

WWW '99 Proceedings of the eighth international conference on World Wide Web
Neighborhood preserving hashing and approximate queries

SODA '94 Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Selectively estimation for Boolean queries

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A comparison of techniques to find mirrored hosts on the WWW

Journal of the American Society for Information Science
Improved bounds for dictionary look-up with one error

Information Processing Letters
Efficient and tumble similar set retrieval

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Searching the Web

ACM Transactions on Internet Technology (TOIT)
Evaluating topic-driven web crawlers

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Evaluating strategies for similarity search on the web

Proceedings of the 11th international conference on World Wide Web
Detecting similar documents using salient terms

Proceedings of the eleventh international conference on Information and knowledge management
Mining the Web: Discovering Knowledge from HyperText Data

Mining the Web: Discovering Knowledge from HyperText Data
Venti: A New Approach to Archival Storage

FAST '02 Proceedings of the Conference on File and Storage Technologies
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Dictionary Look-Up within Small Edit Distance

COCOON '02 Proceedings of the 8th Annual International Conference on Computing and Combinatorics
Approximate Dictionary Queries

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Efficient URL caching for world wide web crawling

WWW '03 Proceedings of the 12th international conference on World Wide Web
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
On finding duplication and near-duplication in large software systems

WCRE '95 Proceedings of the Second Working Conference on Reverse Engineering
Finding Interesting Associations without Support Pruning

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A bag of paths model for measuring structural similarity in Web documents

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Constructing a text corpus for inexact duplicate detection

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Improved robustness of signature-based near-replica detection via lexicon randomization

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient Phrase-Based Document Indexing for Web Document Clustering

IEEE Transactions on Knowledge and Data Engineering
User-centric Web crawling

WWW '05 Proceedings of the 14th international conference on World Wide Web
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference

iRobot: an intelligent crawler for web forums

Proceedings of the 17th international conference on World Wide Web
On searching compressed string collections cache-obliviously

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Exploring traversal strategy for web forum crawling

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
SpotSigs: robust and efficient near duplicate detection in large web collections

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
De-duping URLs via rewrite rules

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Achieving both high precision and high recall in near-duplicate detection

Proceedings of the 17th ACM conference on Information and knowledge management
Efficient overlap and content reuse detection in blogs and online news articles

Proceedings of the 18th international conference on World wide web
Sitemaps: above and beyond the crawl of duty

Proceedings of the 18th international conference on World wide web
IRLbot: Scaling to 6 billion pages and beyond

ACM Transactions on the Web (TWEB)
Leveraging discarded samples for tighter estimation of multiple-set aggregates

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Automatic video tagging using content redundancy

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Automatic retrieval of similar content using search engine query interface

Proceedings of the 18th ACM conference on Information and knowledge management
URL normalization for de-duplication of web pages

Proceedings of the 18th ACM conference on Information and knowledge management
Coordinated weighted sampling for estimating aggregates over multiple weight assignments

Proceedings of the VLDB Endowment
Exploiting Sentence-Level Features for Near-Duplicate Document Detection

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Learning URL patterns for webpage de-duplication

Proceedings of the third ACM international conference on Web search and data mining
A pattern tree-based approach to learning URL normalization rules

Proceedings of the 19th international conference on World wide web
b-Bit minwise hashing

Proceedings of the 19th international conference on World wide web
Detecting near-duplicates in large-scale short text databases

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Similarity search and locality sensitive hashing using ternary content addressable memories

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Adaptive near-duplicate detection via similarity learning

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Efficient partial-duplicate detection based on sequence matching

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Similarity joins as stronger metric operations

SIGSPATIAL Special
A locality-sensitive hash for real vectors

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Simple and efficient algorithm for approximate dictionary matching

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
XML structural similarity search using mapreduce

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Fixing the threshold for effective detection of near duplicate web documents in web crawling

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Exponential time improvement for min-wise based algorithms

Information and Computation
PRESIDIO: A Framework for Efficient Archival Data Storage

ACM Transactions on Storage (TOS)
Content redundancy in YouTube and its application to video tagging

ACM Transactions on Information Systems (TOIS)
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
Semi-supervised SimHash for efficient document similarity search

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Hypergeometric language models for republished article finding

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Fast locality-sensitive hashing

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
A strategy for efficient crawling of rich internet applications

ICWE'11 Proceedings of the 11th international conference on Web engineering
Probabilistic near-duplicate detection using simhash

Proceedings of the 20th ACM international conference on Information and knowledge management
CoDet: sentence-based containment detection in news corpora

Proceedings of the 20th ACM international conference on Information and knowledge management
Detection of near-duplicate user generated contents: the SMS spam collection

Proceedings of the 3rd international workshop on Search and mining user-generated contents
A simhash-based scheme for locating product information from the web

Proceedings of the Second Symposium on Information and Communication Technology
Temporal shingling for version identification in web archives

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Exponential time improvement for min-wise based algorithms

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
Bayesian locality sensitive hashing for fast similarity search

Proceedings of the VLDB Endowment
Efficient semantic-aware detection of near duplicate resources

ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part II
A fusion of algorithms in near duplicate document detection

PAKDD'11 Proceedings of the 15th international conference on New Frontiers in Applied Data Mining
FoCUS: learning to crawl web forums

Proceedings of the 21st international conference companion on World Wide Web
Clustering and load balancing optimization for redundant content removal

Proceedings of the 21st international conference companion on World Wide Web
Is simhash achilles?

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
On generating large-scale ground truth datasets for the deduplication of bibliographic records

Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
High-confidence near-duplicate image detection

Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
De-duplication of aggregation authority files

International Journal of Metadata, Semantics and Ontologies
Fast near neighbor search in high-dimensional binary data

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Detecting near-duplicate documents using sentence-level features and supervised learning

Expert Systems with Applications: An International Journal
Crawling deep web entity pages

Proceedings of the sixth ACM international conference on Web search and data mining
HmSearch: an efficient hamming distance query processing algorithm

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
Groundhog day: near-duplicate detection on Twitter

Proceedings of the 22nd international conference on World Wide Web
Bottom-k and priority sampling, set similarity and subset sums with minimal independence

Proceedings of the forty-fifth annual ACM symposium on Theory of computing
Revision graph extraction in Wikipedia based on supergram decomposition

Proceedings of the 9th International Symposium on Open Collaboration
Near duplicate detection in an academic digital library

Proceedings of the 2013 ACM symposium on Document engineering
The fingerprint analysis technique-oriented research on microblog for public opinion analysis

Proceedings of the Fifth International Conference on Internet Multimedia Computing and Service
Locality sensitive hashing for scalable structural classification and clustering of web documents

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
A pattern-based selective recrawling approach for object-level vertical search

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Efficient filtering and ranking schemes for finding inclusion dependencies on the web

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
De-duplication of aggregation authority files

International Journal of Metadata, Semantics and Ontologies
b-bit minwise hashing in practice

Proceedings of the 5th Asia-Pacific Symposium on Internetware
Efficient top-k retrieval with signatures

Proceedings of the 18th Australasian Document Computing Symposium
Streaming similarity search over one billion tweets using parallel locality-sensitive hashing

Proceedings of the VLDB Endowment
Efficient estimation for high similarities using odd sketches

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.02

Visualization

Abstract

Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we demonstrate that Charikar's fingerprinting technique is appropriate for this goal. Second, we present an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and all batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.