Collecting hidden weeb pages for data extraction

Authors:
Juliano Palmieri Lage;Altigran S. da Silva;Paulo B. Golgher;Alberto H. F. Laender
Affiliations:
Federal University of Minas Gerais, Belo Horizonte MG Brazil;Federal University of Amazonas, Manaus AM Brazil;Akwan Information Technologies, Belo Horizonte MG Brazil;Federal University of Minas Gerais, Belo Horizonte MG Brazil
Venue:
Proceedings of the 4th international workshop on Web information and data management
Year:
2002

Citing 9
Cited 2

A scalable comparison-shopping agent for the World-Wide Web

AGENTS '97 Proceedings of the first international conference on Autonomous agents
Database techniques for the World-Wide Web: a survey

ACM SIGMOD Record
A layered architecture for querying dynamic Web content

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
Bootstrapping for example-based data extraction

Proceedings of the tenth international conference on Information and knowledge management
DEByE - Date extraction by example

Data & Knowledge Engineering
The Debye Environment for Web Data Management

IEEE Internet Computing
Automating the Internet: Agents as User Surrogates

IEEE Internet Computing
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases

SmartCrawl: a new strategy for the exploration of the hidden web

Proceedings of the 6th annual ACM international workshop on Web information and data management
An automatic data grabber for large web sites

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the Web grows, more and more data has become available under dynamic forms of publication, such as a legacy database accessed by an HTML form (the so called Hidden Web). In situations such as this, integration of this data relies more and more on the fast generation of page fetching agents. As a result, there is an increasing need for tools that can help the user to generate such agents. In this paper, we describe an approach to automatically generating agents to collect hidden Web pages that uses a pre-existing data repository for identifying the contents of these pages and takes the advantage of some regularities that can be found among Web sites. To demonstrate the effectiveness of our approach, we discuss the results of a number of experiments carried out with sites from different domains. We also dicuss how such regularities among sites can be formalized.