Improved robustness of signature-based near-replica detection via lexicon randomization
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages
Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Detecting phrase-level duplication on the world wide web
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Redundant documents and search effectiveness
Proceedings of the 14th ACM international conference on Information and knowledge management
Managing déjà vu: Collection building for the identification of nonidentical duplicate documents
Journal of the American Society for Information Science and Technology - Research Articles
Undue influence: eliminating the impact of link plagiarism on web search rankings
Proceedings of the 2006 ACM symposium on Applied computing
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
The discoverability of the web
Proceedings of the 16th international conference on World Wide Web
Improving web spam classifiers using link structure
AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
TAPER: tiered approach for eliminating redundancy in replica synchronization
FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Distributed text retrieval from overlapping collections
ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Essential deduplication functions for transactional databases in law firms
Proceedings of the 11th international conference on Artificial intelligence and law
Combinatorial algorithms for web search engines: three success stories
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Spamscatter: characterizing internet scam hosting infrastructure
SS'07 Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium
Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
Improving web information indexing and retrieval based on center block duplication detection
International Journal of Innovative Computing and Applications
Lexicon randomization for near-duplicate detection with I-Match
The Journal of Supercomputing
De-duping URLs via rewrite rules
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Achieving both high precision and high recall in near-duplicate detection
Proceedings of the 17th ACM conference on Information and knowledge management
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Detecting the origin of text segments efficiently
Proceedings of the 18th international conference on World wide web
IRLbot: Scaling to 6 billion pages and beyond
ACM Transactions on the Web (TWEB)
URL normalization for de-duplication of web pages
Proceedings of the 18th ACM conference on Information and knowledge management
Learning URL patterns for webpage de-duplication
Proceedings of the third ACM international conference on Web search and data mining
Understanding content reuse on the web: static and dynamic analyses
WebKDD'06 Proceedings of the 8th Knowledge discovery on the web international conference on Advances in web mining and web usage analysis
Weighted shingling: an adaptation of shingling for weighted shingles
IIT'09 Proceedings of the 6th international conference on Innovations in information technology
Proceedings of the 21st ACM conference on Hypertext and hypermedia
Adaptive near-duplicate detection via similarity learning
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Comparing the sensitivity of information retrieval metrics
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
A hierarchical adaptive probabilistic approach for zero hour phish detection
ESORICS'10 Proceedings of the 15th European conference on Research in computer security
Foundations and Trends in Information Retrieval
Efficient similarity joins for near-duplicate detection
ACM Transactions on Database Systems (TODS)
ViDeDup: an application-aware framework for video de-duplication
HotStorage'11 Proceedings of the 3rd USENIX conference on Hot topics in storage and file systems
On the evolution of clusters of near-duplicate web pages
Journal of Web Engineering
CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites
ACM Transactions on Information and System Security (TISSEC)
Detection of near-duplicate user generated contents: the SMS spam collection
Proceedings of the 3rd international workshop on Search and mining user-generated contents
A systematic study of parameter correlations in large scale duplicate document detection
PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Compact features for detection of near-duplicates in distributed retrieval
SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
The case of the duplicate documents measurement, search, and science
APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Thwarting the nigritude ultramarine: learning to identify link spam
ECML'05 Proceedings of the 16th European conference on Machine Learning
Fast discovery of similar sequences in large genomic collections
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
A fusion of algorithms in near duplicate document detection
PAKDD'11 Proceedings of the 15th international conference on New Frontiers in Applied Data Mining
Detecting quilted web pages at scale
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Detecting near-duplicate documents using sentence-level features and supervised learning
Expert Systems with Applications: An International Journal
Hi-index | 0.00 |
This paper expands on a 1997 study of the amount and distribution of near-duplicate pages on the World Wide Web. We downloaded a set of 150 million web pages on a weekly basisover the span of 11 weeks. We then determined which of these pages are near-duplicates of one another, and tracked how clusters of near-duplicate documents evolved over time. We found that 29.2% of allweb pages are very similar to other pages, and that 22.2% are virtually identical to other pages. We also found that clusters of near-duplicate documents are fairly stable: Two documents that are near-duplicates of one another are very likely to still be near-duplicates 10 weeks later. This result is of significant relevance to search engines: Web crawlers can be fairly confident that two pages that have been found to be near-duplicates of one another will continue to be so for the foreseeable future, and may thus decide to recrawl only one version of that page, or at least to lower the download priority of the other versions,thereby freeing up crawling resources that can be brought to bear more productively somewhere else.