Extracting XML data from the web

  • Authors:
  • Ngo Sy Viet Phu;Toshiyuki Amagasa;Hiroyuki Kitagawa

  • Affiliations:
  • University of Tsukuba, Japan;University of Tsukuba, Japan;University of Tsukuba, Japan

  • Venue:
  • Proceedings of the 12th International Conference on Information Integration and Web-based Applications & Services
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Information Extraction (IE) is a technique to extract structured information (record) from unstructured documents such as Web pages. However, existing techniques are basically aiming at extracting simple records, such as binary relationships like "(company, location)" or named entities like "(organization)". In this paper, we propose an algorithm for extracting complex records like XML by utilizing an existing IE technique. Given a set of seed records in the form of XML data (XML records), we firstly infer the schema information from the XML records. Then, we transform the XML records to a set of relational records consisting of several tables. The obtained relational tables are decomposed into a set of binary relations, and they are forwarded to a record extraction system. We reconstruct XML data from the results obtained from the record of the extraction system. We point out a naive implementation docs not work well, and propose an improved scheme for more efficient XML record extraction. We evaluate the effectiveness of our proposed algorithm in some experiments.