Adaptive record extraction from web pages

  • Authors:
  • Justin Park;Denilson Barbosa

  • Affiliations:
  • University of Calgary, Calgary, AB, Canada;University of Calgary, Calgary, AB, Canada

  • Venue:
  • Proceedings of the 16th international conference on World Wide Web
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We describe an adaptive method for extracting records from web pages. Our algorithm combines a weighted tree matching metric with clustering for obtaining data extraction patterns.We compare our method experimentally to the state-of-the-art, and show that our approach is very competitive for rigidly-structured records (such as product descriptions) and far superior for loosely-structured records (such as entrieson blogs).