Generating finite-state transducers for semi-structured data extraction from the Web
Information Systems - Special issue on semistructured data
Wrapper induction: efficiency and expressiveness
Artificial Intelligence - Special issue on Intelligent internet systems
Hierarchical Wrapper Induction for Semistructured Information Sources
Autonomous Agents and Multi-Agent Systems
Automatic information extraction from semi-structured Web pages by pattern discovery
Decision Support Systems - Web retrieval and mining
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
A Petri net-based model for web service composition
ADC '03 Proceedings of the 14th Australasian database conference - Volume 17
Composing Web services on the Semantic Web
The VLDB Journal — The International Journal on Very Large Data Bases
WetDL: a web information extraction language
ADVIS'04 Proceedings of the Third international conference on Advances in Information Systems
Information Fusion
WetDL: a web information extraction language
ADVIS'04 Proceedings of the Third international conference on Advances in Information Systems
Hi-index | 0.00 |
Most recent research in the field of information extraction from the Web has concentrated on the task of extracting the underlying content of a set of similarly structured web pages. However in order to build real-world web information extraction applications this is not sufficient. Indeed, building such applications requires fully automating the access to web sources. This does not just involve the extraction of the data from web pages. There is a need to set up the necessary infrastructure allowing to query a source, retrieve the result pages, extract the results from these pages and filter out the unwanted results. In this paper we show how such an infrastructure can be set up. We propose to build a web information extraction application by decomposing it into sub-tasks and describing it in an XML based language named WetDL. Each of the sub-tasks consists in applying a web information extraction specific operation onto its input, one of these operators being the application of an extractor. By connecting such operations together it is possible to simply define complex applications. This is shown in the paper by applying this approach to real-world information extraction tasks such as extracting DVD listings from Amazon. com, extracting addresses from online telephone directories superpages.com, etc.