A two-phase rule generation and optimization approach for wrapper generation

  • Authors:
  • Yanan Hao;Yanchun Zhang

  • Affiliations:
  • School of Computer Science and Mathematics, Victoria University, Melbourne, VIC, Australia;School of Computer Science and Mathematics, Victoria University, Melbourne, VIC, Australia

  • Venue:
  • ADC '06 Proceedings of the 17th Australasian Database Conference - Volume 49
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Web information extraction is a fundamental issue for web information management and integrations. A common approach is to use wrappers to extract data from web pages or documents. However, a critical issue for wrapper development is how to generate extraction rules. In this paper, we propose a novel two-phase rule generation and optimization (2P-RULE) approach for wrapper generation. 2P-RULE consists of internal rule optimization (IRO) process and external rule optimization (ERO) process. In IRO, a user, through a GUI interface, firstly creates a mapping from useful values in web page to a schema specified by the users according to target web information. Based on the mapping, the system automatically generates a rule list for the schema. Whereas in ERO, the user can create multiple mappings to generate further rule lists. All the acquired rule lists are merged and refined into one optimized rule list, which is expressed with XQuery as the final extraction rules. Experiments show that our 2P-RULE approach is suitable for extracting information from web pages with complex nested structure, and can also achieve better precision and recall ratio.