A simple approach to the design of site-level extractors using domain-centric principles

Authors:
Chong Long;Xiubo Geng;Chang Xu;Sathiya Keerthi
Affiliations:
Yahoo!, Beijing, China;Yahoo!, Beijing, China;Chinese Academy of Sciences, Beijing, China;Microsoft, Redmond, WA, USA
Venue:
Proceedings of the 21st ACM international conference on Information and knowledge management
Year:
2012

Citing 7
Cited 0

Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Online Passive-Aggressive Algorithms

The Journal of Machine Learning Research
Information Extraction

Foundations and Trends in Databases
Incorporating site-level knowledge to extract structured data from web forums

Proceedings of the 18th international conference on World wide web
Using clustering and edit distance techniques for automatic web data extraction

WISE'07 Proceedings of the 8th international conference on Web information systems engineering
Automatic wrappers for large scale web extraction

Proceedings of the VLDB Endowment
Web information extraction using markov logic networks

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of extracting, in a domain-centric fashion, a given set of attributes from a large number of semi-structured websites. Previous approaches [7, 5] to solve this problem are based on page level inference. We propose a distinct new approach that directly chooses attribute extractors for a site using a scoring mechanism that is designed at the domain level via simple classification methods using a training set from a small number of sites. To keep the number of candidate extractors in each site manageably small we use two observations that hold in most domains: (a) imprecise annotators can be used to identify a small set of candidate extractors for a few attributes (anchors); and (b) non-anchor attributes lie in close proximity to the anchor attributes. Experiments on three domains (Events, Books and Restaurants) show that our approach is very effective in spite of its simplicity.