Improving Vietnamese web page clustering by combining neighbors' content and using iterative feature selection

  • Authors:
  • Le Viet Hung;Nguyen Thi Kim Anh;Nguyen Hai Dang

  • Affiliations:
  • Hanoi University of Science and Technology;Hanoi University of Science and Technology;Hanoi University of Science and Technology

  • Venue:
  • Proceedings of the Third Symposium on Information and Communication Technology
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Web page clustering is a fundamental technique to offer a solution for data management, information locating and its interpretation of Web data and to facilitate users for navigation, discrimination and understanding. Most existing clustering algorithms can't adapt well to Web page clustering directly in terms of efficiency and effectiveness due to the problems of high dimensionality and data sparseness. Furthermore, the uncontrolled nature of web content presents additional challenges to web page clustering, whereas the interconnected characteristic of hypertext can provide useful information for the process. To address this problem, we propose a new Web page clustering method with combining neighbors' content to overcome data sparseness and using Iterative Feature Selection to remove noisy and redundant features and to improve the performance of clustering algorithm. Experimental results show that the proposed method significantly improves the performance of the Vietnamese web page clustering with a relatively small number of good descriptive features for web pages.