Learning Information Extraction Rules for Semi-Structured and Free Text
Machine Learning - Special issue on natural language learning
A brief survey of web data extraction tools
ACM SIGMOD Record
ACM SIGMOD Record
Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto
Proceedings of the 27th International Conference on Very Large Data Bases
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Supervised Visual Wrapper Generator for Web-Data Extraction
COMPSAC '03 Proceedings of the 27th Annual International Conference on Computer Software and Applications
Automatic data extraction from data-rich web pages
DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications
Hi-index | 0.00 |
Web information extraction is a fundamental issue for web information management and integrations. A common approach is to use wrappers to extract data from web pages or documents. However, a critical issue for wrapper development is how to generate extraction rules. In this paper, we propose a novel two-phase rule generation and optimization (2P-RULE) approach for wrapper generation. 2P-RULE consists of internal rule optimization (IRO) process and external rule optimization (ERO) process. In IRO, a user, through a GUI interface, firstly creates a mapping from useful values in web page to a schema specified by the users according to target web information. Based on the mapping, the system automatically generates a rule list for the schema. Whereas in ERO, the user can create multiple mappings to generate further rule lists. All the acquired rule lists are merged and refined into one optimized rule list, which is expressed with XQuery as the final extraction rules. Experiments show that our 2P-RULE approach is suitable for extracting information from web pages with complex nested structure, and can also achieve better precision and recall ratio.