Detecting near-duplicates for web crawling

  • Authors:
  • Gurmeet Singh Manku;Arvind Jain;Anish Das Sarma

  • Affiliations:
  • Google Inc., Mountain View, CA;Google Inc., Mountain View, CA;Stanford University, Stanford, CA

  • Venue:
  • Proceedings of the 16th international conference on World Wide Web
  • Year:
  • 2007

Quantified Score

Hi-index 0.02

Visualization

Abstract

Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we demonstrate that Charikar's fingerprinting technique is appropriate for this goal. Second, we present an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and all batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.