Adapting Web information extraction knowledge via mining site-invariant and site-dependent features

  • Authors:
  • Tak-Lam Wong;Wai Lam

  • Affiliations:
  • City University of Hong Kong, Kowloon, Hong Kong;The Chinese University of Hong Kong, Shatin, Hong Kong

  • Venue:
  • ACM Transactions on Internet Technology (TOIT)
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We develop a novel framework that aims at automatically adapting previously learned information extraction knowledge from a source Web site to a new unseen target site in the same domain. Two kinds of features related to the text fragments from the Web documents are investigated. The first type of feature is called, a site-invariant feature. These features likely remain unchanged in Web pages from different sites in the same domain. The second type of feature is called a site-dependent feature. These features are different in the Web pages collected from different Web sites, while they are similar in the Web pages originating from the same site. In our framework, we derive the site-invariant features from previously learned extraction knowledge and the items previously collected or extracted from the source Web site. The derived site-invariant features will be exploited to automatically seek a new set of training examples in the new unseen target site. Both the site-dependent features and the site-invariant features of these automatically discovered training examples will be considered in the learning of new information extraction knowledge for the target site. We conducted extensive experiments on a set of real-world Web sites collected from three different domains to demonstrate the performance of our framework. For example, by just providing training examples from one online book catalog Web site, our approach can automatically extract information from ten different book catalog sites achieving an average precision and recall of 71.9% and 84.0% respectively without any further manual intervention.