A Probabilistic Approach for Adapting Information Extraction Wrappers and Discovering New Attributes

Authors:
Tak-Lam Wong;Wai Lam
Affiliations:
The Chinese University of Hong Kong, Shatin;The Chinese University of Hong Kong, Shatin
Venue:
ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Year:
2004

Citing 0
Cited 9

Hot Item Mining and Summarization from Multiple Auction Web Sites

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Adapting Web information extraction knowledge via mining site-invariant and site-dependent features

ACM Transactions on Internet Technology (TOIT)
AUTOMATIC DOMAIN ONTOLOGY GENERATION FROM WEB SITES

Journal of Integrated Design & Process Science
An unsupervised method for joint information extraction and feature mining across different Web sites

Data & Knowledge Engineering
Attribute retrieval from relational web tables

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Towards a framework for attribute retrieval

Proceedings of the 20th ACM international conference on Information and knowledge management
Extracting and summarizing hot item features across different auction web sites

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Training conditional random fields with unlabeled data and limited number of labeled examples

ICMLC'05 Proceedings of the 4th international conference on Advances in Machine Learning and Cybernetics
Aggregated search: A new information retrieval paradigm

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We develop a probabilistic framework for adapting information extraction wrappers with new attribute discovery. Wrapper adaptation aims at automatically adapting a previously learned wrapper from the source Web site to a new unseen site for information extraction. One unique characteristic of our framework is that it can discover new or previously unseen attributes as well as headers from the new site. It is based on a generative model for the generation of text fragments related to attribute items and formatting data in a Web page. To solve the wrapper adaptation problem, we consider two kinds of information from the source Web site. The first kind of information is the extraction knowledge contained in the previously learned wrapper from the source Web site. The second kind of information is the previously extracted or collected items. We employ a Bayesian learning approach to automatically select a set of training examples for adapting a wrapper for the new unseen site. To solve the new attribute discovery problem, we develop a model which analyzes the surrounding text fragments of the attributes in the new unseen site. A Bayesian learning method is developed to discover the new attributes and their headers. EM technique is employed in both Bayesian learning models. We conducted extensive experiments from a number of real-world Web sites to demonstrate the effectiveness of our framework.