Adaptive geospatially focused crawling

  • Authors:
  • Dirk Ahlers;Susanne Boll

  • Affiliations:
  • OFFIS -- Institute for Information Technology, Oldenburg, Germany;University of Oldenburg, Oldenburg, Germany

  • Venue:
  • Proceedings of the 18th ACM conference on Information and knowledge management
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Location information on the Web is a precious asset for a multitude of applications and is becoming an increasingly important dimension in Web search. Even though more and more Web pages carry location information, they form only a small share of all pages and are scattered over the Web. To efficiently find and index location-related Web content, we propose an efficient crawling strategy that retrieves precisely those pages that are geospatially relevant while minimizing the amount of the non-spatially-relevant pages within the crawled pages. We propose to address this challenge by expanding the technique of focused crawling to exploit location references on Web pages to specifically retrieve geospatial topics on the Web. In this paper, we describe the design and development of a focused crawler with an adaptive geospatial focus that efficiently retrieves and identifies location-relevant documents on the Web. Drawing from geospatial features of both Web pages and the link graph, a crawl strategy based on Bayesian classifiers prioritizes promising links and pages, leading to a faster coverage of the desired geospatial topic as a means for fast creation of precise geospatial Web indexes. We present evaluations of the system's performance and share our findings on the geospatial Web graph and the distribution of location references on the Web.