A simple approach to the design of site-level extractors using domain-centric principles

  • Authors:
  • Chong Long;Xiubo Geng;Chang Xu;Sathiya Keerthi

  • Affiliations:
  • Yahoo!, Beijing, China;Yahoo!, Beijing, China;Chinese Academy of Sciences, Beijing, China;Microsoft, Redmond, WA, USA

  • Venue:
  • Proceedings of the 21st ACM international conference on Information and knowledge management
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

We consider the problem of extracting, in a domain-centric fashion, a given set of attributes from a large number of semi-structured websites. Previous approaches [7, 5] to solve this problem are based on page level inference. We propose a distinct new approach that directly chooses attribute extractors for a site using a scoring mechanism that is designed at the domain level via simple classification methods using a training set from a small number of sites. To keep the number of candidate extractors in each site manageably small we use two observations that hold in most domains: (a) imprecise annotators can be used to identify a small set of candidate extractors for a few attributes (anchors); and (b) non-anchor attributes lie in close proximity to the anchor attributes. Experiments on three domains (Events, Books and Restaurants) show that our approach is very effective in spite of its simplicity.