High performance crawling system

  • Authors:
  • Younès Hafri;Chabane Djeraba

  • Affiliations:
  • Ecole Polytechnique de Nantes, Cédex, France;UMR CNRS, Cédex - France

  • Venue:
  • Proceedings of the 6th ACM SIGMM international workshop on Multimedia information retrieval
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

In the present paper, we will describe the design and implementation of a real-time distributed system of Web crawling running on a cluster of machines. The system crawls several thousands of pages every second, includes a high-performance fault manager, is platform independent and is able to adapt transparently to a wide range of configurations without incurring additional hardware expenditure. We will then provide details of the system architecture and describe the technical choices for very high performance crawling. Finally, we will discuss the experimental results obtained, comparing them with other documented systems