A Framework for Generating Attribute Extractors for Web Data Sources

Authors:
Davi de Castro Reis;Robson Braga Araújo;Altigran Soares da Silva;Berthier A. Ribeiro-Neto
Affiliations:
-;-;-;-
Venue:
SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Year:
2002

Citing 10
Cited 2

A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Extracting semi-structured data through examples

Proceedings of the eighth international conference on Information and knowledge management
Data on the Web: from relations to semistructured data and XML

Data on the Web: from relations to semistructured data and XML
Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
Bootstrapping for example-based data extraction

Proceedings of the tenth international conference on Information and knowledge management
A brief survey of web data extraction tools

ACM SIGMOD Record
DEByE - Date extraction by example

Data & Knowledge Engineering
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases

The Web-DL environment for building digital libraries from the Web

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Structure-driven crawler generation by example

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

To cope with the irregularities of typical semistructured Web data, extraction tools usually break the extraction task in two phases: an extraction phase, in which atomic attribute values are extracted from Web pages, and an assembling phase, in which these atomic values are grouped to form complex objects. As a consequence, the whole process is highly dependent on the attribute values collected in the first phase. All attribute values of interest should be properly recognized and spurious values should be discarded. Thus, attribute values extraction is an important problem. In this paper, we propose a new framework for generating attribute value extractors. The main appeal of this framework is that it can be adapted for dealing with specific types of data sources and to incorporate distinct types of heuristics for achieving good extraction performance. To demonstrate the feasibility of this proposal, we present an implementation of this framework for data-rich Web pages and show how a number of simple heuristics, some of them presented in the recent literature, can be incorporated into this framework. We also show experimental results and, in most cases, our results are at least as good as results previously presented in the literature.