An unsupervised framework for extracting and normalizing product attributes from multiple web sites

Authors:
Tak-Lam Wong;Wai Lam;Tik-Shun Wong
Affiliations:
The Chinese University of Hong Kong, Hong Kong, Hong Kong;The Chinese University of Hong Kong, Hong Kong, Hong Kong;The Chinese University of Hong Kong, Hong Kong, Hong Kong
Venue:
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2008

Citing 13
Cited 12

Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data

ICML '04 Proceedings of the twenty-first international conference on Machine learning
An integrated, conditional model of information extraction and coreference with application to citation matching

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Adaptive information extraction

ACM Computing Surveys (CSUR)
Adapting Web information extraction knowledge via mining site-invariant and site-dependent features

ACM Transactions on Internet Technology (TOIT)
Entity Resolution with Markov Logic

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Mining templates from search result records of search engines

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Webpage understanding: an integrated approach

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Context-aware wrapping: synchronized data extraction

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Semi-supervised learning of attribute-value pairs from product descriptions

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence

Extracting structured information from user queries with semi-supervised conditional random fields

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
An unsupervised approach for product record normalization across different web sites

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Simultaneous Product Attribute Name and Value Extraction from Web Pages

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Product feature categorization with multilevel latent semantic association

Proceedings of the 18th ACM conference on Information and knowledge management
OpinionIt: a text mining system for cross-lingual opinion analysis

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Normalizing web product attributes and discovering domain ontology with minimal effort

Proceedings of the fourth ACM international conference on Web search and data mining
Extracting and ranking product features in opinion documents

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Opinion word expansion and target extraction through double propagation

Computational Linguistics
ILDA: interdependent LDA model for learning latent aspects and their ratings from online product reviews

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Joint sentiment aspect model

Proceedings of the 27th Annual ACM Symposium on Applied Computing
CharaParser for fine-grained semantic annotation of organism morphological descriptions

Journal of the American Society for Information Science and Technology
A dynamic learning framework to thoroughly extract structured data from web pages without human efforts

Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We have developed an unsupervised framework for simultaneously extracting and normalizing attributes of products from multiple Web pages originated from different sites. Our framework is designed based on a probabilistic graphical model that can model the page-independent content information and the page-dependent layout information of the text fragments in Web pages. One characteristic of our framework is that previously unseen attributes can be discovered from the clue contained in the layout format of the text fragments. Our framework tackles both extraction and normalization tasks by jointly considering the relationship between the content and layout information. Dirichlet process prior is employed leading to another advantage that the number of discovered product attributes is unlimited. An unsupervised inference algorithm based on variational method is presented. The semantics of the normalized attributes can be visualized by examining the term weights in the model. Our framework can be applied to a wide range of Web mining applications such as product matching and retrieval. We have conducted extensive experiments from four different domains consisting of over 300 Web pages from over 150 different Web sites, demonstrating the robustness and effectiveness of our framework.