Improving suffix tree clustering with new ranking and similarity measures

  • Authors:
  • Phiradit Worawitphinyo;Xiaoying Gao;Shahida Jabeen

  • Affiliations:
  • School of Engineering and Computer Science, Victoria University of Wellington, Wellington, New Zealand;School of Engineering and Computer Science, Victoria University of Wellington, Wellington, New Zealand;School of Engineering and Computer Science, Victoria University of Wellington, Wellington, New Zealand

  • Venue:
  • ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part II
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Retrieving relevant information from web, containing enormous amount of data, is a highly complicated research area. A landmark research that contributes to this area is web clustering which efficiently organizes a large amount of web documents into a small number of meaningful and coherent groups[1,2]. Various techniques aim at accurately categorizing the web pages into clusters automatically. Suffix Tree Clustering (STC) is a phrase-based, state-of-art algorithm for web clustering that automatically groups semantically related documents based on shared phrases. Research has shown that it has outperformed other clustering algorithms such as K-means and Buckshot due to its efficient utilization of phrases to identify the clusters. Using STC as the baseline, we introduce a new method for ranking base clusters and new similarity measures for comparing clusters. Our STHAC technique combines the Heirarchical Agglomerative clustering method with phrase based Suffix Tree clustering to improve the cluster merging process. Experimental results have shown that STHAC outperforms the original STC as well as ESTC(our precious extended version of STC) with 16% increase in F-measure. This increase in F-measure of STHAC is achieved due to its better filtering of low score clusters, better similarity measures and efficient cluster merging algorithms.