Structure-driven crawler generation by example

  • Authors:
  • Márcio L. A. Vidal;Altigran S. da Silva;Edleno S. de Moura;João M. B. Cavalcanti

  • Affiliations:
  • Universidade Federal do Amazonas, Manaus -- Brazil;Universidade Federal do Amazonas, Manaus -- Brazil;Universidade Federal do Amazonas, Manaus -- Brazil;Universidade Federal do Amazonas, Manaus -- Brazil

  • Venue:
  • SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many Web IR and Digital Library applications require a crawling process to collect pages with the ultimate goal of taking advantage of useful information available on Web sites. For some of these applications the criteria to determine when a page is to be present in a collection are related to the page content. However, there are situations in which the inner structure of the pages provides a better criteria to guide the crawling process than their content. In this paper, we present a structure-driven approach for generating Web crawlers that requires a minimum effort from users. The idea is to take as input a sample page and an entry point to a Web site and generate a structure-driven crawler based on navigation patterns, sequences of patterns for the links a crawler has to follow to reach the pages structurally similar to the sample page. In the experiments we have carried out, structure-driven crawlers generated by our new approach were able to collect all pages that match the samples given, including those pages added after their generation.