Designing efficient sampling techniques to detect webpage updates

Authors:
Qingzhao Tan;Ziming Zhuang;Prasenjit Mitra;C. Lee Giles
Affiliations:
Penn State University;Penn State University;Penn State University;Penn State University
Venue:
Proceedings of the 16th international conference on World Wide Web
Year:
2007

Citing 3
Cited 4

A large-scale study of the evolution of web pages

WWW '03 Proceedings of the 12th international conference on World Wide Web
Effective page refresh policies for Web crawlers

ACM Transactions on Database Systems (TODS)
Effective change detection using sampling

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

ChemXSeer: a digital library and data repository for chemical kinetics

Proceedings of the ACM first workshop on CyberInfrastructure: information management in eScience
Topical web crawling using weighted anchor text and web page change detection techniques

WSEAS Transactions on Information Science and Applications
Automatic retrieval of similar content using search engine query interface

Proceedings of the 18th ACM conference on Information and knowledge management
Development of an intelligent distributed news retrieval system

International Journal of Knowledge-based and Intelligent Engineering Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to resource constraints, Web archiving systems and search engines usually have difficulties keeping the entire local repository synchronized with the Web. We advance the state-of-art of the sampling-based synchronization techniques by answering a challenging question: Given a sampled webpage and its change status, which other webpages are also likely to change? We present a study of various downloading granularities and policies, and propose an adaptive model based on the update history and the popularity of the webpages. We run extensive experiments on a large dataset of approximately 300,000 webpages to demonstrate that it is most likely to find more updated webpages in the current or upper directories of the changed samples. Moreover, the adaptive strategies outperform the non-adaptive one in terms of detecting important changes.