Collecting hidden weeb pages for data extraction

  • Authors:
  • Juliano Palmieri Lage;Altigran S. da Silva;Paulo B. Golgher;Alberto H. F. Laender

  • Affiliations:
  • Federal University of Minas Gerais, Belo Horizonte MG Brazil;Federal University of Amazonas, Manaus AM Brazil;Akwan Information Technologies, Belo Horizonte MG Brazil;Federal University of Minas Gerais, Belo Horizonte MG Brazil

  • Venue:
  • Proceedings of the 4th international workshop on Web information and data management
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

As the Web grows, more and more data has become available under dynamic forms of publication, such as a legacy database accessed by an HTML form (the so called Hidden Web). In situations such as this, integration of this data relies more and more on the fast generation of page fetching agents. As a result, there is an increasing need for tools that can help the user to generate such agents. In this paper, we describe an approach to automatically generating agents to collect hidden Web pages that uses a pre-existing data repository for identifying the contents of these pages and takes the advantage of some regularities that can be found among Web sites. To demonstrate the effectiveness of our approach, we discuss the results of a number of experiments carried out with sites from different domains. We also dicuss how such regularities among sites can be formalized.