Wrapper generation for semi-structured Internet sources

  • Authors:
  • Naveen Ashish;Craig A. Knoblock

  • Affiliations:
  • Information Sciences Institute and Department of Computer Science, University of Southern California, 4676 Admiralty Way Marina del Rey, CA;Information Sciences Institute and Department of Computer Science, University of Southern California, 4676 Admiralty Way Marina del Rey, CA

  • Venue:
  • ACM SIGMOD Record
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the current explosion of information on the World Wide Web (WWW) a wealth of information on many different subjects has become available on-line. Numerous sources contain information that can be classified as semi-structured. At present, however, the only way to access the information is by browsing individual pages. We cannot query web documents in a database-like fashion based on their underlying structure. However, we can provide database-like querying for semi-structured WWW sources by building wrappers around these sources. We present an approach for semi-automatically generating such wrappers. The key idea is to exploit the formatting information in pages from the source to hypothesize the underlying structure of a page. From this structure the system generates a wrapper that facilitates querying of a source and possibly integrating it with other sources. We demonstrate the ease with which we are able to build wrappers for a number of internet sources in different domains using our implemented wrapper generation toolkit.