Clustering-based incremental web crawling

Authors:
Qingzhao Tan;Prasenjit Mitra
Affiliations:
The Pennsylvania State University, University Park, PA;The Pennsylvania State University, University Park, PA
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2010

Citing 31
Cited 3

Statistics: concepts and applications

Statistics: concepts and applications
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The shark-search algorithm. An application: tailored Web site mapping

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Analysis of a very large web search engine query log

ACM SIGIR Forum
Synchronizing a database to improve freshness

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
How dynamic is the Web?

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
WTMS: a system for collecting for collecting and analyzing topic-specific Web information

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval

Proceedings of the ninth international conference on Information and knowledge management
An adaptive model for optimizing performance of an incremental web crawler

Proceedings of the 10th international conference on World Wide Web
Evaluating topic-driven web crawlers

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Optimal crawling strategies for web search engines

Proceedings of the 11th international conference on World Wide Web
Mercator: A scalable, extensible Web crawler

World Wide Web
Keeping Up with the Changing Web

Computer
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Predictive caching and prefetching of query results in search engines

WWW '03 Proceedings of the 12th international conference on World Wide Web
A large-scale study of the evolution of web pages

WWW '03 Proceedings of the 12th international conference on World Wide Web
Effective page refresh policies for Web crawlers

ACM Transactions on Database Systems (TODS)
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Impact of search engines on page popularity

Proceedings of the 13th international conference on World Wide Web
User-centric Web crawling

WWW '05 Proceedings of the 14th international conference on World Wide Web
Looking at both the present and the past to efficiently update replicas of web content

Proceedings of the 7th annual ACM international workshop on Web information and data management
Methods for comparing rankings of search engine results

Computer Networks: The International Journal of Computer and Telecommunications Networking - Web dynamics
Sampling from large graphs

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Rate of change and other metrics: a live study of the world wide web

USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Effective change detection using sampling

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Designing clustering-based web crawling policies for search engine crawlers

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Recrawl scheduling based on information longevity

Proceedings of the 17th international conference on World Wide Web
Graph based crawler seed selection

Proceedings of the 18th international conference on World wide web
Efficiently detecting webpage updates using samples

ICWE'07 Proceedings of the 7th international conference on Web engineering
Predicting Web Page Status

Information Systems Research

A constrained crawling approach and its application to a specialised search engine

International Journal of Information and Communication Technology
Fuzzy combinations of criteria: an application to web page representation for clustering

CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part II
Predicting content change on the web

Proceedings of the sixth ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

When crawling resources, for example, number of machines, crawl-time, and so on, are limited, so a crawler has to decide an optimal order in which to crawl and recrawl Web pages. Ideally, crawlers should request only those Web pages that have changed since the last crawl; in practice, a crawler may not know whether a Web page has changed before downloading it. In this article, we identify features of Web pages that are correlated to their change frequency. We design a crawling algorithm that clusters Web pages based on features that correlate to their change frequencies obtained by examining past history. The crawler downloads a sample of Web pages from each cluster, and depending upon whether a significant number of these Web pages have changed in the last crawl cycle, it decides whether to recrawl the entire cluster. To evaluate the performance of our incremental crawler, we develop an evaluation framework that measures which crawling policy results in the best search results for the end-user. We run experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 Web sites. The results demonstrate that the clustering-based sampling algorithm effectively clusters the pages with similar change patterns, and our clustering-based crawling algorithm outperforms existing algorithms in that it can improve the quality of the user experience for those who query the search engine.