On the evolution of clusters of near-duplicate web pages

Authors:
Dennis Fetterly;Mark Manasse;Marc Najork
Affiliations:
Microsoft Research, Mountain View, CA;Microsoft Research, Mountain View, CA;Microsoft Research, Mountain View, CA
Venue:
Journal of Web Engineering
Year:
2003

Citing 10
Cited 2

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Finding related pages in the World Wide Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Mirror, mirror on the Web: a study of host pairs with replicated content

WWW '99 Proceedings of the eighth international conference on World Wide Web
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Min-wise independent permutations

Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
A comparison of techniques to find mirrored hosts on the WWW

Journal of the American Society for Information Science
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
A large-scale study of the evolution of web pages

WWW '03 Proceedings of the 12th international conference on World Wide Web
On the Evolution of Clusters of Near-Duplicate Web Pages

LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference

Applying syntactic similarity algorithms for enterprise information management

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Survey on web spam detection: principles and algorithms

ACM SIGKDD Explorations Newsletter

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper expands on a 1997 study of the amount and distribution of near-duplicate pages on the World Wide Web. We downloaded a set of 150 million web pages on a weekly basisover the span of 11 weeks. We then determined which of these pages are near-duplicates of one another, and tracked how clusters of near-duplicate documents evolved over time. We found that 29.2% of all web pages are very similar to other pages, and that 22.2% are virtually identical to other pages. We also found that clusters of near-duplicate documents are fairly stable: Two documents that are near-duplicates of one another are very likely to still be near-duplicates 10 weeks later. This result is of significant relevance to search engines: Web crawlers can be fairly confident that two pages that have been found to be near-duplicates of one another will continue to be so for the foreseeable future, and may thus decide to recrawl only one version of that page, or at least to lower the download priority of the other versions, thereby freeing up crawling resources that can be brought to bear more productively somewhere else. Additionally, we visit issues raised in a 1999 study of the prevalence of mirrored content, that is, trees of web content accessible at multiple locations. We found that 4.9% of all web pages are mirrors.