Building Web Information Extraction Tasks

Authors:
Benjamin Habegger;Mohamed Quafafou
Affiliations:
Laboratoire dýInformatique de Nantes Atlantique, France;Institut des Applications Avances de l'Internet, Ecole de l'Internet de Marseille, France
Venue:
WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Year:
2004

Citing 8
Cited 2

Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Automatic information extraction from semi-structured Web pages by pattern discovery

Decision Support Systems - Web retrieval and mining
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
A Petri net-based model for web service composition

ADC '03 Proceedings of the 14th Australasian database conference - Volume 17
Composing Web services on the Semantic Web

The VLDB Journal — The International Journal on Very Large Data Bases
WetDL: a web information extraction language

ADVIS'04 Proceedings of the Third international conference on Advances in Information Systems

Multi-data source fusion

Information Fusion
WetDL: a web information extraction language

ADVIS'04 Proceedings of the Third international conference on Advances in Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most recent research in the field of information extraction from the Web has concentrated on the task of extracting the underlying content of a set of similarly structured web pages. However in order to build real-world web information extraction applications this is not sufficient. Indeed, building such applications requires fully automating the access to web sources. This does not just involve the extraction of the data from web pages. There is a need to set up the necessary infrastructure allowing to query a source, retrieve the result pages, extract the results from these pages and filter out the unwanted results. In this paper we show how such an infrastructure can be set up. We propose to build a web information extraction application by decomposing it into sub-tasks and describing it in an XML based language named WetDL. Each of the sub-tasks consists in applying a web information extraction specific operation onto its input, one of these operators being the application of an extractor. By connecting such operations together it is possible to simply define complex applications. This is shown in the paper by applying this approach to real-world information extraction tasks such as extracting DVD listings from Amazon. com, extracting addresses from online telephone directories superpages.com, etc.