Efficiently detecting webpage updates using samples

Authors:
Qingzhao Tan;Ziming Zhuang;Prasenjit Mitra;C. Lee Giles
Affiliations:
Computer Science and Engineering, The Pennsylvania State University, University Park, PA;Information Sciences and Technology, The Pennsylvania State University, University Park, PA;Computer Science and Engineering and Information Sciences and Technology, The Pennsylvania State University, University Park, PA;Computer Science and Engineering and Information Sciences and Technology, The Pennsylvania State University, University Park, PA
Venue:
ICWE'07 Proceedings of the 7th international conference on Web engineering
Year:
2007

Citing 20
Cited 2

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Synchronizing a database to improve freshness

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Topical locality in the Web

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
How dynamic is the Web?

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval

Proceedings of the ninth international conference on Information and knowledge management
Crawler-Friendly Web Servers

ACM SIGMETRICS Performance Evaluation Review
An adaptive model for optimizing performance of an incremental web crawler

Proceedings of the 10th international conference on World Wide Web
Optimal crawling strategies for web search engines

Proceedings of the 11th international conference on World Wide Web
Mercator: A scalable, extensible Web crawler

World Wide Web
Keeping Up with the Changing Web

Computer
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
A large-scale study of the evolution of web pages

WWW '03 Proceedings of the 12th international conference on World Wide Web
Effective page refresh policies for Web crawlers

ACM Transactions on Database Systems (TODS)
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
User-centric Web crawling

WWW '05 Proceedings of the 14th international conference on World Wide Web
Effective web crawling

ACM SIGIR Forum
Looking at both the present and the past to efficiently update replicas of web content

Proceedings of the 7th annual ACM international workshop on Web information and data management
Sampling from large graphs

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Rate of change and other metrics: a live study of the world wide web

USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Effective change detection using sampling

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Clustering-based incremental web crawling

ACM Transactions on Information Systems (TOIS)
A constrained crawling approach and its application to a specialised search engine

International Journal of Information and Communication Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to resource constraints, Web archiving systems and search engines usually have difficulties keeping the local repository completely synchronized with the Web. To address this problem, sampling-based techniques periodically poll a subset of webpages in the local repository to detect changes on the Web, and update the local copies accordingly. The goal of such an approach is to discover as many changed webpages as possible within the boundary of the available resources. In this paper we advance the state-of-art of the sampling-based techniques by answering a challenging question: Given a sampled webpage that has been updated, which other webpages are also likely to have changed? We propose a set of sampling policies with various downloading granularities, taking into account the link structure, the directory structure, and the content-based features. We also investigate the update history and the popularity of the webpages to adaptively model the download probability. We ran extensive experiments on a real web data set of about 300,000 distinct URLs distributed among 210 websites. The results showed that our sampling-based algorithm can detect about three times as many changed webpages as the baseline algorithm. It also showed that the changed webpages are most likely to be found in the same directory and the upper directories of the changed sample. By applying clustering algorithm on all the webpages, pages with similar change pattern are grouped together so that updated webpages can be found in the same cluster as the changed sample. Moreover, our adaptive downloading strategies significantly outperform the static ones in detecting changes for the popular webpages.