Optimized k-means clustering with intelligent initial centroid selection for web search using URL and tag contents

Authors:
S. Poomagal;T. Hamsapriya
Affiliations:
Research Scholar, PSG College of Technology, Coimbatore, Tamilnadu, India;PSG College of Technology, Coimbatore, Tamilnadu, India
Venue:
Proceedings of the International Conference on Web Intelligence, Mining and Semantics
Year:
2011

Citing 11
Cited 1

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Accelerating exact k-means algorithms with geometric reasoning

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Grouper: a dynamic clustering interface to Web search results

WWW '99 Proceedings of the eighth international conference on World Wide Web
Multidimensional binary search trees used for associative searching

Communications of the ACM
Statistical Language Learning

Statistical Language Learning
An Efficient k-Means Clustering Algorithm: Analysis and Implementation

IEEE Transactions on Pattern Analysis and Machine Intelligence
MARSYAS: a framework for audio analysis

Organised Sound
A new algorithm for clustering search results

Data & Knowledge Engineering
k-means++: the advantages of careful seeding

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Web People Search via Connection Analysis

IEEE Transactions on Knowledge and Data Engineering
K-means Clustering Algorithm with Improved Initial Center

WKDD '09 Proceedings of the 2009 Second International Workshop on Knowledge Discovery and Data Mining

Sentiment analysis based on clustering: a framework in improving accuracy and recognizing neutral opinions

Applied Intelligence

Quantified Score

Hi-index	0.04

Visualization

Abstract

With the vast amount of information available online, searching results for a given query requires the user to go through many titles and snippets. This searching time can be reduced by clustering search results into clusters so that the user can select the relevant cluster at a glance by looking at the cluster labels. For web page clustering, terms (features) can be extracted from different parts of the web page. Giansalvatore, Salvatore and Alessandro have extracted terms from the entire web page for clustering. Number of terms returned in this case is more and it produces lengthy vectors. To reduce the size of the vector, Stanis law Osinski et al., and Ahmed Sameh and Amar Kadray have considered terms from the snippets. In this work, terms are extracted from the URL (Uniform Resource Locator), Title tag and Meta tag to cluster the web documents. The reason for selecting these parts of a web page is that they have the keywords which are available in a web page. Among existing clustering algorithms, K-means algorithm is a simple algorithm and can be easily implemented for solving many practical problems. The disadvantage of K-means algorithm is the random selection of initial centroids and this paper selects initial centroids by calculating the midpoint. The proposed method of clustering is compared with Snippet based clustering and URL and Tag content based traditional K-means clustering in terms of Intra-cluster distance and Inter-cluster distance.