Mining the web with hierarchical crawlers – a resource sharing based crawling approach

Authors:
Anirban Kundu;Ruma Dutta;Rana Dattagupta;Debajyoti Mukhopadhyay
Affiliations:
Netaji Subhash Engineering College, West Bengal University of Technology, West Bengal-/700 152, India/ Web Intelligence & Distributed Computing Research Lab, (/(/WIDiCoReL)/, Green Tow ...;Netaji Subhash Engineering College, West Bengal University of Technology, West Bengal-/700 152, India/ Web Intelligence & Distributed Computing Research Lab, (/(/WIDiCoReL)/, Green Tow ...;Jadavpur University, West Bengal-/700 032, India.;Calcutta Business School, Diamond Harbour Road, Bishnupur, West Bengal-/743 503, India/ Web Intelligence & Distributed Computing Research Lab, (/(/WIDiCoReL)/, Green Tower C-/9&# ...
Venue:
International Journal of Intelligent Information and Database Systems
Year:
2009

Citing 11
Cited 0

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Topical locality in the Web

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Searching the Web

ACM Transactions on Internet Technology (TOIT)
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
Using web structure for classifying and describing web pages

Proceedings of the 11th international conference on World Wide Web
Mining the Web's Link Structure

Computer
Self-Organization and Identification of Web Communities

Computer
An approach to confidence based page ranking for user oriented Web search

ACM SIGMOD Record
An Alternate Way to Rank Hyper-linked Web-Pages

ICIT '06 Proceedings of the 9th International Conference on Information Technology
FlexiRank: an algorithm offering flexibility and accuracy for ranking the web pages

ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

An important component of any web search engine is its crawler, which is also known as robot or spider. An efficient set of crawlers make any search engine more powerful, apart from its other measures of performance, such as its ranking algorithm, storage mechanism, indexing techniques, etc. In this paper, we have proposed an extended technique for crawling over the World Wide Web (WWW) on behalf of a search engine. This is an approach with multiple crawlers working in parallel combined with the mechanism of focused crawling (Chakrabarti et al., 1999a, 2002; Mukhopadhyay et al., 2006). In this approach, the total structure of any website is divided into several number of levels based on the hyperlink-structure for downloading web pages from that website (Chakrabarti et al., 1999b; Mukhopadhyay and Singh, 2004). The number of crawlers of each level is not fixed, rather dynamic in this context. It is determined at execution time on demand basis using threaded program based on the number of hyperlinks of a specific web page. This paper also proposes a focused hierarchical crawling technique, where crawlers are created dynamically at runtime for different domains to crawl the web pages with the essence of resource sharing.