An unsupervised method for joint information extraction and feature mining across different Web sites

Authors:
Tak-Lam Wong;Wai Lam
Affiliations:
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong;Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, Hong Kong
Venue:
Data & Knowledge Engineering
Year:
2009

Citing 39
Cited 10

Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
Web mining research: a survey

ACM SIGKDD Explorations Newsletter
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Information Extraction with HMM Structures Learned by Stochastic Optimization

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Mining product reputations on the Web

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Armadillo: harvesting information for the semantic web

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Mining reference tables for automatic text segmentation

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining and summarizing customer reviews

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic information extraction from large websites

Journal of the ACM (JACM)
A Probabilistic Approach for Adapting Information Extraction Wrappers and Discovering New Attributes

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
An integrated, conditional model of information extraction and coreference with application to citation matching

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Opinion observer: analyzing and comparing opinions on the Web

WWW '05 Proceedings of the 14th international conference on World Wide Web
Mining interesting knowledge from weblogs: a survey

Data & Knowledge Engineering
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Learning to extract information from semi-structured text using a discriminative context free grammar

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Price prediction and insurance for online auctions

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Clustering web pages based on their structure

Data & Knowledge Engineering - Special issue: WIDM 2003
Unsupervised named-entity extraction from the web: an experimental study

Artificial Intelligence
Hot Item Mining and Summarization from Multiple Auction Web Sites

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Information extraction from structured documents using k-testable tree automaton inference

Data & Knowledge Engineering
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Combining Information Extraction Systems Using Voting and Stacked Generalization

The Journal of Machine Learning Research
Collective information extraction with relational Markov networks

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Exploiting structural similarity for effective Web information extraction

Data & Knowledge Engineering
Unsupervised learning of field segmentation models for information extraction

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Extracting product features and opinions from reviews

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Integrating probabilistic extraction models and data mining to discover relations and patterns in text

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Sampling, information extraction and summarisation of hidden web databases

Data & Knowledge Engineering - Special issue: WIDM 2004
Context-aware wrapping: synchronized data extraction

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Open information extraction from the web

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Semi-supervised learning of attribute-value pairs from product descriptions

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Loopy belief propagation for approximate inference: an empirical study

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
Extracting and summarizing hot item features across different auction web sites

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Factor graphs and the sum-product algorithm

IEEE Transactions on Information Theory

Acquisition of instance attributes via labeled and related instances

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Instance sense induction from attribute sets

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Extracting hot spots of topics from time-stamped documents

Data & Knowledge Engineering
News information extraction based on adaptive weighting using unsupervised Bayesian algorithm

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Attribute retrieval from relational web tables

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Towards a framework for attribute retrieval

Proceedings of the 20th ACM international conference on Information and knowledge management
The role of query sessions in extracting instance attributes from web search queries

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Data extraction from web pages based on structural-semantic entropy

Proceedings of the 21st international conference companion on World Wide Web
Tackling incompleteness in information extraction --- a complementarity approach

ESWC'12 Proceedings of the 9th international conference on The Semantic Web: research and applications
Aggregated search: A new information retrieval paradigm

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We develop an unsupervised learning framework which can jointly extract information and conduct feature mining from a set of Web pages across different sites. One characteristic of our model is that it allows tight interactions between the tasks of information extraction and feature mining. Decisions for both tasks can be made in a coherent manner leading to solutions which satisfy both tasks and eliminate potential conflicts at the same time. Our approach is based on an undirected graphical model which can model the interdependence between the text fragments within the same Web page, as well as text fragments in different Web pages. Web pages across different sites are considered simultaneously and hence information from different sources can be effectively leveraged. An approximate learning algorithm is developed to conduct inference over the graphical model to tackle the information extraction and feature mining tasks. We demonstrate the efficacy of our framework by applying it to two applications, namely, important product feature mining from vendor sites, and hot item feature mining from auction sites. Extensive experiments on real-world data have been conducted to demonstrate the effectiveness of our framework.