High-performance priority queues for parallel crawlers

Authors:
Mauricio Marin;Rodrigo Paredes;Carolina Bonacic
Affiliations:
Yahoo! Research Latin America, Santiago, Chile;Yahoo! Research Latin America, Santiago, Chile;Complutense University of Madrid, Madrid, Spain
Venue:
Proceedings of the 10th ACM workshop on Web information and data management
Year:
2008

Citing 9
Cited 2

A bridging model for parallel computation

Communications of the ACM
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Algorithm 65: find

Communications of the ACM
Parallel crawlers

Proceedings of the 11th international conference on World Wide Web
Adaptive on-line page importance computation

WWW '03 Proceedings of the 12th international conference on World Wide Web
UbiCrawler: a scalable fully distributed web crawler

Software—Practice & Experience
Crawling a country: better strategies than breadth-first for web page ordering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Parallel crawling for online social networks

Proceedings of the 16th international conference on World Wide Web
Bulk-Synchronous On-Line Crawling on Clusters of Computers

PDP '08 Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008)

Building efficient multi-threaded search nodes

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
On-line multi-threaded processing of web user-clicks on multi-core processors

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large scale data centers for crawlers are able to maintain a very large number of active http connections in order to download as fast as possible the usually huge number of web pages from given sections of the WWW. This generates a continuous stream of new URLs of documents to be downloaded and it is clear that the associated work-load can only be served efficiently with proper parallel computing techniques. The incoming new URLs have to be organized by a priority measure in order to download the most relevant documents first. Efficiently managing them along with other synchronization issues such as URLs downloaded by different processing nodes forming a cluster of computers are the matters of this paper. We propose efficient and scalable strategies which consider intra-node multi-core multi-threading on an inter-nodes distributed memory environment, including efficient use of secondary memory.