An unsupervised approach for product record normalization across different web sites

Authors:
Tak-Lam Wong;Tik-Shun Wong;Wai Lam
Affiliations:
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong;Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong;Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong
Venue:
AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Year:
2008

Citing 13
Cited 0

Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A hierarchical graphical model for record linkage

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Learning to extract information from semi-structured text using a discriminative context free grammar

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Adaptive information extraction

ACM Computing Surveys (CSUR)
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Adapting Web information extraction knowledge via mining site-invariant and site-dependent features

ACM Transactions on Internet Technology (TOIT)
Entity Resolution with Markov Logic

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Canonicalization of database records using adaptive similarity measures

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Mining templates from search result records of search engines

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
An unsupervised framework for extracting and normalizing product attributes from multiple web sites

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

An unsupervised probabilistic learning framework for normalizing product records across different retailer Web sites is presented. Our framework decomposes the problem into two tasks to achieve the goal. The first task aims at extracting attribute values of products from different sites and normalizing them into appropriate reference attributes. This task is challenging because the set of reference attributes is unknown in advance. Besides, the layout formats are different in different Web sites. The second task is to conduct product record normalization aiming at identifying product records referring to the same reference product based on the results of the first task. We develop a graphical model for the generation of text fragments in Web pages to accomplish the two tasks. One characteristic of our model is that the product attributes to be extracted are not required to be specified in advance and an unlimited number of previously unseen product attributes can be handled. We compare our framework with existing methods. Extensive experiments using over 300 Web pages from over 150 real-world Web sites from three different domains have been conducted demonstrating the effectiveness of our framework.