High-performance priority queues for parallel crawlers

  • Authors:
  • Mauricio Marin;Rodrigo Paredes;Carolina Bonacic

  • Affiliations:
  • Yahoo! Research Latin America, Santiago, Chile;Yahoo! Research Latin America, Santiago, Chile;Complutense University of Madrid, Madrid, Spain

  • Venue:
  • Proceedings of the 10th ACM workshop on Web information and data management
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Large scale data centers for crawlers are able to maintain a very large number of active http connections in order to download as fast as possible the usually huge number of web pages from given sections of the WWW. This generates a continuous stream of new URLs of documents to be downloaded and it is clear that the associated work-load can only be served efficiently with proper parallel computing techniques. The incoming new URLs have to be organized by a priority measure in order to download the most relevant documents first. Efficiently managing them along with other synchronization issues such as URLs downloaded by different processing nodes forming a cluster of computers are the matters of this paper. We propose efficient and scalable strategies which consider intra-node multi-core multi-threading on an inter-nodes distributed memory environment, including efficient use of secondary memory.