A two-phase rule generation and optimization approach for wrapper generation

Authors:
Yanan Hao;Yanchun Zhang
Affiliations:
School of Computer Science and Mathematics, Victoria University, Melbourne, VIC, Australia;School of Computer Science and Mathematics, Victoria University, Melbourne, VIC, Australia
Venue:
ADC '06 Proceedings of the 17th Australasian Database Conference - Volume 49
Year:
2006

Citing 9
Cited 0

Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
A brief survey of web data extraction tools

ACM SIGMOD Record
Wrapping web data into XML

ACM SIGMOD Record
Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Supervised Visual Wrapper Generator for Web-Data Extraction

COMPSAC '03 Proceedings of the 27th Annual International Conference on Computer Software and Applications
Automatic data extraction from data-rich web pages

DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web information extraction is a fundamental issue for web information management and integrations. A common approach is to use wrappers to extract data from web pages or documents. However, a critical issue for wrapper development is how to generate extraction rules. In this paper, we propose a novel two-phase rule generation and optimization (2P-RULE) approach for wrapper generation. 2P-RULE consists of internal rule optimization (IRO) process and external rule optimization (ERO) process. In IRO, a user, through a GUI interface, firstly creates a mapping from useful values in web page to a schema specified by the users according to target web information. Based on the mapping, the system automatically generates a rule list for the schema. Whereas in ERO, the user can create multiple mappings to generate further rule lists. All the acquired rule lists are merged and refined into one optimized rule list, which is expressed with XQuery as the final extraction rules. Experiments show that our 2P-RULE approach is suitable for extracting information from web pages with complex nested structure, and can also achieve better precision and recall ratio.