A flexible learning system for wrapping tables and lists in HTML documents

  • Authors:
  • William W. Cohen;Matthew Hurst;Lee S. Jensen

  • Affiliations:
  • WhizBang Labs, Pittsburgh, PA;WhizBang Labs, Pittsburgh, PA;WhizBang Labs, Pittsburgh, PA

  • Venue:
  • Proceedings of the 11th international conference on World Wide Web
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

A program that makes an existing website look like a database is called a wrapper. Wrapper learning is the problem of learning website wrappers from examples. We present a wrapper-learning system called WL2 that can exploit several different representations of a document. Examples of such different representations include DOM-level and token-level representations, as well as two-dimensional geometric views of the rendered page (for tabular data) and representations of the visual appearance of text asm it will be rendered. Additionally, the learning system is modular, and can be easily adapted to new domains and tasks. The learning system described is part of an "industrial-strength" wrapper management system that is in active use at WhizBang Labs. Controlled experiments show that the learner has broader coverage and a faster learning rate than earlier wrapper-learning systems.