Hot Item Mining and Summarization from Multiple Auction Web Sites
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Adapting Web information extraction knowledge via mining site-invariant and site-dependent features
ACM Transactions on Internet Technology (TOIT)
AUTOMATIC DOMAIN ONTOLOGY GENERATION FROM WEB SITES
Journal of Integrated Design & Process Science
Data & Knowledge Engineering
Attribute retrieval from relational web tables
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Towards a framework for attribute retrieval
Proceedings of the 20th ACM international conference on Information and knowledge management
Extracting and summarizing hot item features across different auction web sites
PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Training conditional random fields with unlabeled data and limited number of labeled examples
ICMLC'05 Proceedings of the 4th international conference on Advances in Machine Learning and Cybernetics
Aggregated search: A new information retrieval paradigm
ACM Computing Surveys (CSUR)
Hi-index | 0.00 |
We develop a probabilistic framework for adapting information extraction wrappers with new attribute discovery. Wrapper adaptation aims at automatically adapting a previously learned wrapper from the source Web site to a new unseen site for information extraction. One unique characteristic of our framework is that it can discover new or previously unseen attributes as well as headers from the new site. It is based on a generative model for the generation of text fragments related to attribute items and formatting data in a Web page. To solve the wrapper adaptation problem, we consider two kinds of information from the source Web site. The first kind of information is the extraction knowledge contained in the previously learned wrapper from the source Web site. The second kind of information is the previously extracted or collected items. We employ a Bayesian learning approach to automatically select a set of training examples for adapting a wrapper for the new unseen site. To solve the new attribute discovery problem, we develop a model which analyzes the surrounding text fragments of the attributes in the new unseen site. A Bayesian learning method is developed to discover the new attributes and their headers. EM technique is employed in both Bayesian learning models. We conducted extensive experiments from a number of real-world Web sites to demonstrate the effectiveness of our framework.