High-performance web crawling

Authors:
Marc Najork;Allan Heydon
Affiliations:
Compaq Computer Corporation Systems Research Center, Palo Alto, CA;Model N, Inc., South San Francisco, CA
Venue:
Handbook of massive data sets
Year:
2002

Citing 7
Cited 7

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Measuring index quality using random walks on the Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
WebBase: a repository of Web pages

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
On near-uniform URL sampling

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
An adaptive model for optimizing performance of an incremental web crawler

Proceedings of the 10th international conference on World Wide Web
Mercator: A scalable, extensible Web crawler

World Wide Web

Agents, Crawlers, and Web Retrieval

CIA '02 Proceedings of the 6th International Workshop on Cooperative Information Agents VI
UbiCrawler: a scalable fully distributed web crawler

Software—Practice & Experience
Interest-based personalized search

ACM Transactions on Information Systems (TOIS)
Urban web crawling

Proceedings of the first international workshop on Location and the web
The scalable hyperlink store

Proceedings of the 20th ACM conference on Hypertext and hypermedia
Adaptive focused crawling

The adaptive web
A solution to the exact match on rare item searches: introducing the lost sheep algorithm

Proceedings of the International Conference on Web Intelligence, Mining and Semantics

Quantified Score

Hi-index	0.00

Visualization

Abstract

High-performance web crawlers are an important component of many web services. For example, search services use web crawlers to populate their indices, comparison shopping engines use them to collect product and pricing information from online vendors, and the Internet Archive uses them to record a history of the Internet. The design of a high-performance crawler poses many challenges, both technical and social, primarily due to the large scale of the web. The web crawler must be able to download pages at a very high rate, yet it must not overwhelm any particular web server. Moreover, it must maintain data structures far too large to fit in main memory, yet it must be able to access and update them efficiently. This chapter describes our experience building and operating such a high-performance crawler.