Automatic generation of agents for collecting hidden web pages for data extraction

Authors:
Juliano Palmieri Lage;Altigran S. da Silva;Paulo B. Golgher;Alberto H. F. Laender
Affiliations:
Departamento de Ciência da Computação, Universidade Federal de Minas Gerais, 31270-901, Belo Horizonte, MG, Brazil;Departamento de Ciência da Computação, Universidade Federal do Amazonas, 69077-00, Manaus, AM, Brazil;Akwan Information Technologies, Av. Antônio Abraão Caram, 430-4o. Andar, 31275-000, Belo Horizonte, MG, Brazil;Departamento de Ciência da Computação, Universidade Federal de Minas Gerais, 31270-901, Belo Horizonte, MG, Brazil
Venue:
Data & Knowledge Engineering - Special issue: WIDM 2002
Year:
2004

Citing 18
Cited 24

A scalable comparison-shopping agent for the World-Wide Web

AGENTS '97 Proceedings of the first international conference on Autonomous agents
NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Database techniques for the World-Wide Web: a survey

ACM SIGMOD Record
A layered architecture for querying dynamic Web content

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Grammars have exceptions

Information Systems - Special issue on semistructured data
Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
Building intelligent web applications using lightweight wrappers

Data & Knowledge Engineering - Special issue on heterogeneous information resources need semantic access
Bootstrapping for example-based data extraction

Proceedings of the tenth international conference on Information and knowledge management
Modern Information Retrieval

Modern Information Retrieval
A brief survey of web data extraction tools

ACM SIGMOD Record
DEByE - Date extraction by example

Data & Knowledge Engineering
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
The Web-DL environment for building digital libraries from the Web

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering

A two-phase sampling technique for information extraction from hidden web databases

Proceedings of the 6th annual ACM international workshop on Web information and data management
Clustering web pages based on their structure

Data & Knowledge Engineering - Special issue: WIDM 2003
GoGetIt!: a tool for generating structure-driven web crawlers

Proceedings of the 15th international conference on World Wide Web
Structure-driven crawler generation by example

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Sampling, information extraction and summarisation of hidden web databases

Data & Knowledge Engineering - Special issue: WIDM 2004
iRobot: an intelligent crawler for web forums

Proceedings of the 17th international conference on World Wide Web
An Approach to Deep Web Crawling by Sampling

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
A Genre-Aware Approach to Focused Crawling

World Wide Web
Adaptive focused crawling

The adaptive web
Using structured tokens to identify webpages for data extraction

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Selective recrawling for object-level vertical search

Proceedings of the 19th international conference on World wide web
Exploiting genre in focused crawling

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Understanding deep web search interfaces: a survey

ACM SIGMOD Record
On building a search interface discovery system

RED'09 Proceedings of the 2nd international conference on Resource discovery
Online social network profile data extraction for vulnerability analysis

International Journal of Internet Technology and Secured Transactions
A conceptual framework for efficient web crawling in virtual integration contexts

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
A tool for link-based web page classification

CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes

ACM Transactions on the Web (TWEB)
iDetect: Content Based Monitoring of Complex Networks using Mobile Agents

Applied Soft Computing
Intelligent web navigation

FDIA'09 Proceedings of the Third BCS-IRSG conference on Future Directions in Information Access
Deep Web Information Retrieval Process: A Technical Survey

International Journal of Information Technology and Web Engineering
A pattern-based selective recrawling approach for object-level vertical search

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Hidden-Web induced by client-side scripting: an empirical study

ICWE'13 Proceedings of the 13th international conference on Web Engineering
Selecting queries from sample to crawl deep web data sources

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the Web grows, more and more data has become available under dynamic forms of publication, such as legacy databases accessed by an HTML form (the so called hidden Web). In situations such as this, integration of this data relies more and more on the fast generation of agents that can automatically fetch pages for further processing. As a result, there is an increasing need for tools that can help users generate such agents. In this paper, we describe a method for automatically generating agents to collect hidden Web pages. This method uses a pre-existing data repository for identifying the contents of these pages and takes the advantage of some patterns that can be found among Web sites to identify the navigation paths to follow. To demonstrate the accuracy of our method, we discuss the results of a number of experiments carried out with sites from different domains.