From one tree to a forest: a unified solution for structured web data extraction
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics
Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning
Proceedings of the sixth ACM international conference on Web search and data mining
Unsupervised wrapper induction using linked data
Proceedings of the seventh international conference on Knowledge capture
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model
ACM Transactions on the Web (TWEB)
DOI proxy framework for automated entering and validation of scientific papers
WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Extraction and integration of partially overlapping web sources
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
This paper presents a Bayesian learning framework for adapting information extraction wrappers with new attribute discovery, reducing human effort in extracting precise information from unseen Web sites. Our approach aims at automatically adapting the information extraction knowledge previously learned from a source Web site to a new unseen site, at the same time, discovering previously unseen attributes. Two kinds of text-related clues from the source Web site are considered. The first kind of clue is obtained from the extraction pattern contained in the previously learned wrapper. The second kind of clue is derived from the previously extracted or collected items. A generative model for the generation of the site-independent content information and the site-dependent layout format of the text fragments related to attribute values contained in a Web page is designed to harness the uncertainty involved. Bayesian learning and expectation-maximization (EM) techniques are developed under the proposed generative model for identifying new training data for learning the new wrapper for new unseen sites. Previously unseen attributes together with their semantic labels can also be discovered via another EM-based Bayesian learning based on the generative model. We have conducted extensive experiments from more than 30 real-world Web sites in three different domains to demonstrate the effectiveness of our framework.