Measuring similarity of chinese web databases based on category hierarchy

  • Authors:
  • Juan Liu;Ju Fan;Lizhu Zhou

  • Affiliations:
  • Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing, China;Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing, China;Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing, China

  • Venue:
  • APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The amount of high-quality data in the Web databases has been increasing dramatically. To utilize such wealth of information, measuring the similarity betweenWeb databases has been proposed for many applications, such as clustering and top-k recommendation. Most of the existing methods use the text information either in the interfaces of Web databases or in the Web pages where the interfaces are located, to represent the Web databases. These methods have the limitation that the text may contain a lot of noisy words, which are rarely discriminative and cannot capture the characteristics of the Web databases. To better measure the similarity between Web databases, we introduce a novel Web database similarity method. We employ the categories of the records in the Web databases, which can be automatically extracted from the Web sites where the Web databases are located, to represent the Web databases. The record categories are of high-quality and can capture the characteristics of the corresponding Web databases effectively. In order to better utilize the record categories, we measure the similarity between Web databases based on a unified category hierarchy, and propose an effective method to construct the category hierarchy from the record categories obtained from all the Web databases. We conducted experiments on real ChineseWeb Databases to evaluate our method. The results show that our method is effective in clustering and top-k recommendation for Web Databases, compared with the baseline method, and can be used in real Web database related applications.