On the Evolution of Clusters of Near-Duplicate Web Pages

Authors:
Dennis Fetterly;Mark Manasse;Marc Najork
Affiliations:
-;-;-
Venue:
LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Year:
2003

Citing 0
Cited 44

Improved robustness of signature-based near-replica detection via lexicon randomization

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Redundant documents and search effectiveness

Proceedings of the 14th ACM international conference on Information and knowledge management
Managing déjà vu: Collection building for the identification of nonidentical duplicate documents

Journal of the American Society for Information Science and Technology - Research Articles
Undue influence: eliminating the impact of link plagiarism on web search rankings

Proceedings of the 2006 ACM symposium on Applied computing
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
The discoverability of the web

Proceedings of the 16th international conference on World Wide Web
Improving web spam classifiers using link structure

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
TAPER: tiered approach for eliminating redundancy in replica synchronization

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Distributed text retrieval from overlapping collections

ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Essential deduplication functions for transactional databases in law firms

Proceedings of the 11th international conference on Artificial intelligence and law
Combinatorial algorithms for web search engines: three success stories

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Spamscatter: characterizing internet scam hosting infrastructure

SS'07 Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Improving web information indexing and retrieval based on center block duplication detection

International Journal of Innovative Computing and Applications
Lexicon randomization for near-duplicate detection with I-Match

The Journal of Supercomputing
De-duping URLs via rewrite rules

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Achieving both high precision and high recall in near-duplicate detection

Proceedings of the 17th ACM conference on Information and knowledge management
Annotate once, appear anywhere: collective foraging for snippets of interest using paragraph fingerprinting

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Detecting the origin of text segments efficiently

Proceedings of the 18th international conference on World wide web
IRLbot: Scaling to 6 billion pages and beyond

ACM Transactions on the Web (TWEB)
URL normalization for de-duplication of web pages

Proceedings of the 18th ACM conference on Information and knowledge management
Learning URL patterns for webpage de-duplication

Proceedings of the third ACM international conference on Web search and data mining
Understanding content reuse on the web: static and dynamic analyses

WebKDD'06 Proceedings of the 8th Knowledge discovery on the web international conference on Advances in web mining and web usage analysis
Weighted shingling: an adaptation of shingling for weighted shingles

IIT'09 Proceedings of the 6th international conference on Innovations in information technology
Is this a good title?

Proceedings of the 21st ACM conference on Hypertext and hypermedia
Adaptive near-duplicate detection via similarity learning

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Comparing the sensitivity of information retrieval metrics

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
A hierarchical adaptive probabilistic approach for zero hour phish detection

ESORICS'10 Proceedings of the 15th European conference on Research in computer security
Federated Search

Foundations and Trends in Information Retrieval
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
ViDeDup: an application-aware framework for video de-duplication

HotStorage'11 Proceedings of the 3rd USENIX conference on Hot topics in storage and file systems
On the evolution of clusters of near-duplicate web pages

Journal of Web Engineering
CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites

ACM Transactions on Information and System Security (TISSEC)
Detection of near-duplicate user generated contents: the SMS spam collection

Proceedings of the 3rd international workshop on Search and mining user-generated contents
A systematic study of parameter correlations in large scale duplicate document detection

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Compact features for detection of near-duplicates in distributed retrieval

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
The case of the duplicate documents measurement, search, and science

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Thwarting the nigritude ultramarine: learning to identify link spam

ECML'05 Proceedings of the 16th European conference on Machine Learning
Fast discovery of similar sequences in large genomic collections

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
A fusion of algorithms in near duplicate document detection

PAKDD'11 Proceedings of the 15th international conference on New Frontiers in Applied Data Mining
Detecting quilted web pages at scale

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Detecting near-duplicate documents using sentence-level features and supervised learning

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper expands on a 1997 study of the amount and distribution of near-duplicate pages on the World Wide Web. We downloaded a set of 150 million web pages on a weekly basisover the span of 11 weeks. We then determined which of these pages are near-duplicates of one another, and tracked how clusters of near-duplicate documents evolved over time. We found that 29.2% of allweb pages are very similar to other pages, and that 22.2% are virtually identical to other pages. We also found that clusters of near-duplicate documents are fairly stable: Two documents that are near-duplicates of one another are very likely to still be near-duplicates 10 weeks later. This result is of significant relevance to search engines: Web crawlers can be fairly confident that two pages that have been found to be near-duplicates of one another will continue to be so for the foreseeable future, and may thus decide to recrawl only one version of that page, or at least to lower the download priority of the other versions,thereby freeing up crawling resources that can be brought to bear more productively somewhere else.