Schema-guided wrapper maintenance for web-data extraction

  • Authors:
  • Xiaofeng Meng;Dongdong Hu;Chen Li

  • Affiliations:
  • Renmin University of China, Beijing, China;Renmin University of China, Beijing, China;University of California, Irvine, CA

  • Venue:
  • WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Extracting data from Web pages using wrappers is a fundamental problem arising in a large variety of applications of vast practical interests. There are two main issues relevant to Web-data extraction, namely wrapper generation and wrapper maintenance. In this paper, we propose a novel schema-guided approach to the problem of automatic wrapper maintenance. It is based on the observation that despite various page changes, many important features of the pages are preserved, such as syntactic patterns, annotations, and hyperlinks of the extracted data items. Our approach uses these preserved features to identify the locations of the desired values in the changed pages, and repair wrappers correspondingly by inducing semantic blocks from the HTML tree. Our intensive experiments on real Web sites show that the proposed approach can effectively maintain wrappers to extract desired data with high accuracies.