A scalable, extensible web crawler based on P2P overlay networks

  • Authors:
  • P. Mittal;A. Dixit;A. K. Sharma

  • Affiliations:
  • YMCAIE, Faridabad, Haryana, India;YMCAIE, Fbd., Haryana, India;YMCAIE, Fbd., Haryana, India

  • Venue:
  • Proceedings of the International Conference and Workshop on Emerging Trends in Technology
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

The World Wide Web is an interlinked collection of billions of documents. Ironically the very size of this collection has become an obstacle for information retrieval. The user has to sift through scores of pages to come upon the information he/she desires. Web crawlers are the heart of search engines. Mercator is a scalable, extensible web crawler, which support for extensibility and customizability. This paper explores the challenges and issues faced in using a single FIFO in Mercator as URL frontier. In addition, in Mercator, URL frontier is traversed every time as a new URL arrives. Since Mercator has multithreaded environment, many issues arises by using single FIFO. This paper also explores the concept of pastry in URL frontier implementation. Since single URL frontier is a constraint in multithreaded environment. Peer to peer overlay networks provide locality property, which improves application performance and reduces network usage. Hence Mercator's URL frontier uses the concept peer to peer overlay network's logic to find which is the best suitable canonical form for a newly arrived URL. Hence by using this logic an algorithm is designed which uses minimum comparisons to find the match. This work also explores the idea of using a single FIFO sub queue for each working thread and providing a canonical URL for each working thread. These working threads have a unique canonical URL form, which are grouped by using the concept of hash function. Hence by using this logic an algorithm is designed through which searching is optimized hence this eager strategy is implemented by using the concept of hashing. Proposed system's design features a crawler core for handling the main crawling tasks, and extensibility through protocol and processing modules. The greatest positive impact occurred when there was a pronounceable change in the performance of the system as compared to the existing process.