Accurate and efficient crawling for relevant websites

  • Authors:
  • Martin Ester;Hans-Peter Kriegel;Matthias Schubert

  • Affiliations:
  • Simon Fraser University, School of Computing Science, Burnaby BC, Canada;University of Munich, Institute for Computer Science, Munich, Germany;University of Munich, Institute for Computer Science, Munich, Germany

  • Venue:
  • VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Focused web crawlers have recently emerged as an alternative to the well-established web search engines. While the well-known focused crawlers retrieve relevant webpages, there are various applications which target whole websites instead of single webpages. For example, companies are represented by websites, not by individual webpages. To answer queries targeted at websites, web directories are an established solution. In this paper, we introduce a novel focused website crawler to employ the paradigm of focused crawling for the search of relevant websites. The proposed crawler is based on a two-level architecture and corresponding crawl strategies with an explicit concept of websites. The external crawler views the web as a graph of linked websites, selects the websites to be examined next and invokes internal crawlers. Each internal crawler views the webpages of a single given website and performs focused (page) crawling within that website. Our experimental evaluation demonstrates that the proposed focused website crawler clearly outperforms previous methods of focused crawling which were adapted to retrieve websites instead of single webpages.