Suffix arrays: a new method for on-line string searches
SIAM Journal on Computing
Accelerating exact k-means algorithms with geometric reasoning
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Grouper: a dynamic clustering interface to Web search results
WWW '99 Proceedings of the eighth international conference on World Wide Web
Multidimensional binary search trees used for associative searching
Communications of the ACM
Statistical Language Learning
An Efficient k-Means Clustering Algorithm: Analysis and Implementation
IEEE Transactions on Pattern Analysis and Machine Intelligence
MARSYAS: a framework for audio analysis
Organised Sound
A new algorithm for clustering search results
Data & Knowledge Engineering
k-means++: the advantages of careful seeding
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Web People Search via Connection Analysis
IEEE Transactions on Knowledge and Data Engineering
K-means Clustering Algorithm with Improved Initial Center
WKDD '09 Proceedings of the 2009 Second International Workshop on Knowledge Discovery and Data Mining
Hi-index | 0.04 |
With the vast amount of information available online, searching results for a given query requires the user to go through many titles and snippets. This searching time can be reduced by clustering search results into clusters so that the user can select the relevant cluster at a glance by looking at the cluster labels. For web page clustering, terms (features) can be extracted from different parts of the web page. Giansalvatore, Salvatore and Alessandro have extracted terms from the entire web page for clustering. Number of terms returned in this case is more and it produces lengthy vectors. To reduce the size of the vector, Stanis law Osinski et al., and Ahmed Sameh and Amar Kadray have considered terms from the snippets. In this work, terms are extracted from the URL (Uniform Resource Locator), Title tag and Meta tag to cluster the web documents. The reason for selecting these parts of a web page is that they have the keywords which are available in a web page. Among existing clustering algorithms, K-means algorithm is a simple algorithm and can be easily implemented for solving many practical problems. The disadvantage of K-means algorithm is the random selection of initial centroids and this paper selects initial centroids by calculating the midpoint. The proposed method of clustering is compared with Snippet based clustering and URL and Tag content based traditional K-means clustering in terms of Intra-cluster distance and Inter-cluster distance.