Automatic generation of wrapper for data extraction from the web

  • Authors:
  • Suzhi Zhang;Zhengding Lu

  • Affiliations:
  • College of Computer science and Technology, Huazhong University of Science and technology, Wuhan, Hubei, China and Department of Computer science and Technology, ZhengZhou Institute of Light Indus ...;College of Computer science and Technology, Huazhong University of Science and technology, Wuhan, Hubei, China

  • Venue:
  • ICWE'03 Proceedings of the 2003 international conference on Web engineering
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the development of the Internet, the Web has become invaluable information source. In order to use this information for more than human browsing, web pages in HTML must be converted into a format meaningful to software programs. Wrappers have been a useful technique to convert HTML documents into semantically meaningful XML files. In this paper, we propose a data extraction approach based on extracting schema, which generates automatically a wrapper to extract data from an HTML document, and produces an XML document conforming to given DTD. After the user defines extraction data schema in the form of DTD, the wrapper is generated automatically with the induction and leaning algorithm. The experiment indicates that the approach can correctly extract the required data from the source document with high accuracy.