Normalizing web product attributes and discovering domain ontology with minimal effort

Authors:
Tak-Lam Wong;Lidong Bing;Wai Lam
Affiliations:
The Chinese University of Hong Kong, Hong Kong, Hong Kong;The Chinese University of Hong Kong, Hong Kong, Hong Kong;The Chinese University of Hong Kong, Hong Kong, Hong Kong
Venue:
Proceedings of the fourth ACM international conference on Web search and data mining
Year:
2011

Citing 17
Cited 2

IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Bootstrapping for example-based data extraction

Proceedings of the tenth international conference on Information and knowledge management
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Latent dirichlet allocation

The Journal of Machine Learning Research
A Supervised Visual Wrapper Generator for Web-Data Extraction

COMPSAC '03 Proceedings of the 27th Annual International Conference on Computer Software and Applications
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data

ICML '04 Proceedings of the twenty-first international conference on Machine learning
An integrated, conditional model of information extraction and coreference with application to citation matching

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Adaptive information extraction

ACM Computing Surveys (CSUR)
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Entity Resolution with Markov Logic

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Unsupervised learning of field segmentation models for information extraction

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Webpage understanding: an integrated approach

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
An unsupervised framework for extracting and normalizing product attributes from multiple web sites

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Using structured text for large-scale attribute extraction

Proceedings of the 17th ACM conference on Information and knowledge management
Simultaneous Product Attribute Name and Value Extraction from Web Pages

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Product feature categorization with multilevel latent semantic association

Proceedings of the 18th ACM conference on Information and knowledge management

On the design of LDA models for aspect-based opinion mining

Proceedings of the 21st ACM international conference on Information and knowledge management
The FLDA model for aspect-based opinion mining: addressing the cold start problem

Proceedings of the 22nd international conference on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

We have developed a framework aiming at normalizing product attributes from Web pages collected from different Web sites without the need of labeled training examples. It can deal with pages composed of different layout format and content in an unsupervised manner. As a result, it can handle a variety of different domains with minimal effort. Our model is based on a generative probabilistic graphical model incorporated with Hidden Markov Models (HMM) considering both attribute names and attribute values to extract and normalize text fragments from Web pages in a unified manner. Dirichlet Process is employed to handle the unlimited number of attributes in a domain. An unsupervised inference method is proposed to predict the unobservable variables. We have also developed a method to automatically construct a domain ontology using the normalized product attributes which are the output of the inference on the graphical model. We have conducted extensive experiments and compared with existing works using prouct Web pages collected from real-world Web sites in three different domains to demonstrate the effectiveness of our framework.