Accurately and reliably extracting data from the Web: a machine learning approach

  • Authors:
  • Craig A. Knoblock;Kristina Lerman;Steven Minton;Ion Muslea

  • Affiliations:
  • University of Southern California, 4676 Admiralty Way, Marina del Rey, CA and Fetch Technologies, 4676 Admiralty Way, Marina del Rey, CA;University of Southern California, 4676 Admiralty Way, Marina del Rey, CA;Fetch Technologies, 4676 Admiralty Way, Marina del Rey, CA;University of Southern California, 4676 Admiralty Way, Marina del Rey, CA

  • Venue:
  • Intelligent exploration of the web
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

A critical problem in developing information agents for the Web is accessing data that is formatted for human use. We have developed a set of tools for extracting data from web sites and transforming it into a structured data format, such as XML. The resulting data can then be used to build new applications without having to deal with unstructured data. The advantages of our wrapping technology over previous work are the the ability to learn highly accurate extraction rules, to verify the wrapper to ensure that the correct data continues to be extracted, and to automatically adapt to changes in the sites from which the data is being extracted.