Automatic wrappers for large scale web extraction

  • Authors:
  • Nilesh Dalvi;Ravi Kumar;Mohamed Soliman

  • Affiliations:
  • Yahoo! Research, Santa Clara, CA;Yahoo! Research, Santa Clara, CA;U. of Waterloo, Ontario, Canada

  • Venue:
  • Proceedings of the VLDB Endowment
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a generic framework to make wrapper induction algorithms tolerant to noise in the training data. This enables us to learn wrappers in a completely unsupervised manner from automatically and cheaply obtained noisy training data, e.g., using dictionaries and regular expressions. By removing the site-level supervision that wrapper-based techniques require, we are able to perform information extraction at web-scale, with accuracy unattained with existing unsupervised extraction techniques. Our system is used in production at Yahoo! and powers live applications.