A simhash-based scheme for locating product information from the web

  • Authors:
  • Tuan-Anh N. Pham; Van K. Nguyen

  • Affiliations:
  • Hanoi University of Science and Technology;Hanoi University of Science and Technology

  • Venue:
  • Proceedings of the Second Symposium on Information and Communication Technology
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the explosive growth of commercial websites and Internet-based services, it is crucial to have efficient search services specialized for product information. We share the observation in PEWeb [24], that products are almost always displayed in range of similar-look info pieces showing features and prices for customers to choose and so, the webpage DOM tree would have similar subtrees in the parts corresponding to the product show areas. We propose to use a special hash function, namely Simhash [18], for identifying the product regions. As a signal, subtrees (in the webpage DOM tree) with similar structures would have similar Simhash fingerprints (separated just by a few bits). To eliminate possible miscalls in the first phase using Simhash, we also combine with a decision tree approach which gives us more flexibility especially with product websites developed by Vietnamese companies which prefer certain display formats not very popular worldwide. Compared to PEWeb, our scheme can be more refined and flexible where we have more options to adjust the scheme. This improvement in preciseness is strongly supported by experimental results.