Normalizing web product attributes and discovering domain ontology with minimal effort

  • Authors:
  • Tak-Lam Wong;Lidong Bing;Wai Lam

  • Affiliations:
  • The Chinese University of Hong Kong, Hong Kong, Hong Kong;The Chinese University of Hong Kong, Hong Kong, Hong Kong;The Chinese University of Hong Kong, Hong Kong, Hong Kong

  • Venue:
  • Proceedings of the fourth ACM international conference on Web search and data mining
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We have developed a framework aiming at normalizing product attributes from Web pages collected from different Web sites without the need of labeled training examples. It can deal with pages composed of different layout format and content in an unsupervised manner. As a result, it can handle a variety of different domains with minimal effort. Our model is based on a generative probabilistic graphical model incorporated with Hidden Markov Models (HMM) considering both attribute names and attribute values to extract and normalize text fragments from Web pages in a unified manner. Dirichlet Process is employed to handle the unlimited number of attributes in a domain. An unsupervised inference method is proposed to predict the unobservable variables. We have also developed a method to automatically construct a domain ontology using the normalized product attributes which are the output of the inference on the graphical model. We have conducted extensive experiments and compared with existing works using prouct Web pages collected from real-world Web sites in three different domains to demonstrate the effectiveness of our framework.