AutoFeed: an unsupervised learning system for generating webfeeds

  • Authors:
  • Bora Gazen;Steven Minton

  • Affiliations:
  • Fetch Technologies, El Segundo, CA;Fetch Technologies, El Segundo, CA

  • Venue:
  • Proceedings of the 3rd international conference on Knowledge capture
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Our goal is to automatically extract data from semi-structured webn sites. Previously, researchers have developed two types of supervised learning approaches for extracting web data: methods that create precise, site-specific extraction rules and methods that learn less-precise site-independent extraction rules. In either case, significant training is required. In this paper, we describe a third, more ambitious approach, where we use unsupervised learning to analyze sites and discover their structure. Our method relies on a set of heterogeneous "experts", each of which is capable of identifying certain types of generic structure. Each expert represents its discoveries as "hints". Based on these hints, our system clusters the pages and identifies semi-structured data that can be extracted. To identify a good clustering, we use a probabilistic model of the hint-generation process. The paper describes our formulation of the fully-automatic web-extraction problem, our clustering approach, and our results on a set of experiments.