A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Extracting semi-structured data through examples
Proceedings of the eighth international conference on Information and knowledge management
Data on the Web: from relations to semistructured data and XML
Data on the Web: from relations to semistructured data and XML
Conceptual-model-based data extraction from multiple-record Web pages
Data & Knowledge Engineering
Bootstrapping for example-based data extraction
Proceedings of the tenth international conference on Information and knowledge management
A brief survey of web data extraction tools
ACM SIGMOD Record
DEByE - Date extraction by example
Data & Knowledge Engineering
Hierarchical Wrapper Induction for Semistructured Information Sources
Autonomous Agents and Multi-Agent Systems
Visual Web Information Extraction with Lixto
Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
The Web-DL environment for building digital libraries from the Web
Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Structure-driven crawler generation by example
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Hi-index | 0.00 |
To cope with the irregularities of typical semistructured Web data, extraction tools usually break the extraction task in two phases: an extraction phase, in which atomic attribute values are extracted from Web pages, and an assembling phase, in which these atomic values are grouped to form complex objects. As a consequence, the whole process is highly dependent on the attribute values collected in the first phase. All attribute values of interest should be properly recognized and spurious values should be discarded. Thus, attribute values extraction is an important problem. In this paper, we propose a new framework for generating attribute value extractors. The main appeal of this framework is that it can be adapted for dealing with specific types of data sources and to incorporate distinct types of heuristics for achieving good extraction performance. To demonstrate the feasibility of this proposal, we present an implementation of this framework for data-rich Web pages and show how a number of simple heuristics, some of them presented in the recent literature, can be incorporated into this framework. We also show experimental results and, in most cases, our results are at least as good as results previously presented in the literature.