The impact of term selection in genre-aware focused crawling

  • Authors:
  • Guilherme T. de Assis;Alberto H. F. Laender;Altigran S. da Silva;Marcos André Gonçalves

  • Affiliations:
  • Federal University of Minas Gerais, Belo Horizonte MG Brazil;Federal University of Minas Gerais, Belo Horizonte MG Brazil;Federal University of Amazonas, Manaus AM Brazil;Federal University of Minas Gerais, Belo Horizonte MG Brazil

  • Venue:
  • Proceedings of the 2008 ACM symposium on Applied computing
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

The genre-aware approach to focused crawling aims at crawling pages related to specific topics that can be expressed in terms of both genre and content information. Such an approach requires an expert to specify a set of terms that describe the genre and the content of the pages of interest. In this paper, we analyze the impact of term selection on this approach. Thus, we have performed an experimental study in which we vary the number of genre and content terms used in focused crawling processes aimed at crawling pages related to syllabi (genre) of computer science courses (subject) and sale offers (genre) of computer equipments (subject). This experimental study showed that a small set of terms selected by an expert is usually enough to produce good results. In addition, we propose and experimentally evaluate a strategy for semi-automatic generation of terms to be used in such an approach. The results of these experiments showed that such a strategy is very effective and provides a means to assist an expert in the task of specifying the sets of required terms.