Mining the web with hierarchical crawlers – a resource sharing based crawling approach

  • Authors:
  • Anirban Kundu;Ruma Dutta;Rana Dattagupta;Debajyoti Mukhopadhyay

  • Affiliations:
  • Netaji Subhash Engineering College, West Bengal University of Technology, West Bengal-/700 152, India/ Web Intelligence & Distributed Computing Research Lab, (/(/WIDiCoReL)/, Green Tow ...;Netaji Subhash Engineering College, West Bengal University of Technology, West Bengal-/700 152, India/ Web Intelligence & Distributed Computing Research Lab, (/(/WIDiCoReL)/, Green Tow ...;Jadavpur University, West Bengal-/700 032, India.;Calcutta Business School, Diamond Harbour Road, Bishnupur, West Bengal-/743 503, India/ Web Intelligence & Distributed Computing Research Lab, (/(/WIDiCoReL)/, Green Tower C-/9&# ...

  • Venue:
  • International Journal of Intelligent Information and Database Systems
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

An important component of any web search engine is its crawler, which is also known as robot or spider. An efficient set of crawlers make any search engine more powerful, apart from its other measures of performance, such as its ranking algorithm, storage mechanism, indexing techniques, etc. In this paper, we have proposed an extended technique for crawling over the World Wide Web (WWW) on behalf of a search engine. This is an approach with multiple crawlers working in parallel combined with the mechanism of focused crawling (Chakrabarti et al., 1999a, 2002; Mukhopadhyay et al., 2006). In this approach, the total structure of any website is divided into several number of levels based on the hyperlink-structure for downloading web pages from that website (Chakrabarti et al., 1999b; Mukhopadhyay and Singh, 2004). The number of crawlers of each level is not fixed, rather dynamic in this context. It is determined at execution time on demand basis using threaded program based on the number of hyperlinks of a specific web page. This paper also proposes a focused hierarchical crawling technique, where crawlers are created dynamically at runtime for different domains to crawl the web pages with the essence of resource sharing.