Wrapping web data into XML

  • Authors:
  • Wei Han;David Buttler;Calton Pu

  • Affiliations:
  • Georgia Institute of Technology, Atlanta, Georgia;Georgia Institute of Technology, Atlanta, Georgia;Georgia Institute of Technology, Atlanta, Georgia

  • Venue:
  • ACM SIGMOD Record
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

The vast majority of online information is part of the World Wide Web. In order to use this information for more than human browsing, web pages in HTML must be converted into a format meaningful to software programs. Wrappers have been a useful technique to convert HTML documents into semantically meaningful XML files. However, developing wrappers is slow and labor-intensive. Further, frequent changes on the HTML documents typically require frequent changes in the wrappers. This paper describes XWRAP Elite, a tool to automatically generate robust wrappers. XWRAP breaks down the conversion process into three steps. First, discover where the data is located in an HTML page and separating the data into individual objects. Second, decompose objects into data elements. Third, mark objects and elements in an output format. XWRAP Elite automates the first two steps and minimizes human involvement in marking output data. Our experience shows that XWRAP is able to create useful wrapper software for a wide variety of real world HTML documents.