Hierarchical Wrapper Induction for Semistructured Information Sources

  • Authors:
  • Ion Muslea;Steven Minton;Craig A. Knoblock

  • Affiliations:
  • Information Sciences Institute and Integrated Media Systems Center, University of Southern California, 4676 Admiralty Way, Marina del Rey, CA 90292-6695 muslea@isi.edu;Information Sciences Institute and Integrated Media Systems Center, University of Southern California, 4676 Admiralty Way, Marina del Rey, CA 90292-6695 minton@isi.edu;Information Sciences Institute and Integrated Media Systems Center, University of Southern California, 4676 Admiralty Way, Marina del Rey, CA 90292-6695 knoblock@isi.edu

  • Venue:
  • Autonomous Agents and Multi-Agent Systems
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the tremendous amount of information that becomes available on the Web on a daily basis, the ability to quickly develop information agents has become a crucial problem. A vital component of any Web-based information agent is a set of wrappers that can extract the relevant data from semistructured information sources. Our novel approach to wrapper induction is based on the idea of hierarchical information extraction, which turns the hard problem of extracting data from an arbitrarily complex document into a series of simpler extraction tasks. We introduce an inductive algorithm, STALKER, that generates high accuracy extraction rules based on user-labeled training examples. Labeling the training data represents the major bottleneck in using wrapper induction techniques, and our experimental results show that STALKER requires up to two orders of magnitude fewer examples than other algorithms. Furthermore, STALKER can wrap information sources that could not be wrapped by existing inductive techniques.