Exploiting genre in focused crawling

  • Authors:
  • Guilherme T. De Assis;Alberto H. F Laender;Marcos André Gonçalves;Altigran S. Da Silva

  • Affiliations:
  • Computer Science Department, Federal University of Minas Gerais, Belo Horizonte, MG, Brazil;Computer Science Department, Federal University of Minas Gerais, Belo Horizonte, MG, Brazil;Computer Science Department, Federal University of Minas Gerais, Belo Horizonte, MG, Brazil;Computer Science Department, Federal University of Amazonas, Manaus, AM, Brazil

  • Venue:
  • SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we propose a novel approach to focused crawling that exploits genre and content-related information present in Web pages to guide the crawling process. The effectiveness, efficiency and scalability of this approach are demonstrated by a set of experiments involving the crawling of pages related to syllabi (genre) of computer science courses (content). The results of these experiments show that focused crawlers constructed according to our approach achieve levels of F1 superior to 92% (an average gain of 178% over traditional focused crawlers), requiring the analysis of no more than 60% of the visited pages in order to find 90% of the relevant pages (an average gain of 82% over traditional focused crawlers).