Optimized k-means clustering with intelligent initial centroid selection for web search using URL and tag contents

  • Authors:
  • S. Poomagal;T. Hamsapriya

  • Affiliations:
  • Research Scholar, PSG College of Technology, Coimbatore, Tamilnadu, India;PSG College of Technology, Coimbatore, Tamilnadu, India

  • Venue:
  • Proceedings of the International Conference on Web Intelligence, Mining and Semantics
  • Year:
  • 2011

Quantified Score

Hi-index 0.04

Visualization

Abstract

With the vast amount of information available online, searching results for a given query requires the user to go through many titles and snippets. This searching time can be reduced by clustering search results into clusters so that the user can select the relevant cluster at a glance by looking at the cluster labels. For web page clustering, terms (features) can be extracted from different parts of the web page. Giansalvatore, Salvatore and Alessandro have extracted terms from the entire web page for clustering. Number of terms returned in this case is more and it produces lengthy vectors. To reduce the size of the vector, Stanis law Osinski et al., and Ahmed Sameh and Amar Kadray have considered terms from the snippets. In this work, terms are extracted from the URL (Uniform Resource Locator), Title tag and Meta tag to cluster the web documents. The reason for selecting these parts of a web page is that they have the keywords which are available in a web page. Among existing clustering algorithms, K-means algorithm is a simple algorithm and can be easily implemented for solving many practical problems. The disadvantage of K-means algorithm is the random selection of initial centroids and this paper selects initial centroids by calculating the midpoint. The proposed method of clustering is compared with Snippet based clustering and URL and Tag content based traditional K-means clustering in terms of Intra-cluster distance and Inter-cluster distance.