Automatic generation of agents for collecting hidden web pages for data extraction

  • Authors:
  • Juliano Palmieri Lage;Altigran S. da Silva;Paulo B. Golgher;Alberto H. F. Laender

  • Affiliations:
  • Departamento de Ciência da Computação, Universidade Federal de Minas Gerais, 31270-901, Belo Horizonte, MG, Brazil;Departamento de Ciência da Computação, Universidade Federal do Amazonas, 69077-00, Manaus, AM, Brazil;Akwan Information Technologies, Av. Antônio Abraão Caram, 430-4o. Andar, 31275-000, Belo Horizonte, MG, Brazil;Departamento de Ciência da Computação, Universidade Federal de Minas Gerais, 31270-901, Belo Horizonte, MG, Brazil

  • Venue:
  • Data & Knowledge Engineering - Special issue: WIDM 2002
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

As the Web grows, more and more data has become available under dynamic forms of publication, such as legacy databases accessed by an HTML form (the so called hidden Web). In situations such as this, integration of this data relies more and more on the fast generation of agents that can automatically fetch pages for further processing. As a result, there is an increasing need for tools that can help users generate such agents. In this paper, we describe a method for automatically generating agents to collect hidden Web pages. This method uses a pre-existing data repository for identifying the contents of these pages and takes the advantage of some patterns that can be found among Web sites to identify the navigation paths to follow. To demonstrate the accuracy of our method, we discuss the results of a number of experiments carried out with sites from different domains.