Improving suffix tree clustering with new ranking and similarity measures

Authors:
Phiradit Worawitphinyo;Xiaoying Gao;Shahida Jabeen
Affiliations:
School of Engineering and Computer Science, Victoria University of Wellington, Wellington, New Zealand;School of Engineering and Computer Science, Victoria University of Wellington, Wellington, New Zealand;School of Engineering and Computer Science, Victoria University of Wellington, Wellington, New Zealand
Venue:
ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part II
Year:
2011

Citing 27
Cited 0

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Efficient implementation of suffix trees

Software—Practice & Experience
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Grouper: a dynamic clustering interface to Web search results

WWW '99 Proceedings of the eighth international conference on World Wide Web
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
When information retrieval measures agree about the relative quality of document rankings

Journal of the American Society for Information Science
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Data Mining: An Overview from a Database Perspective

IEEE Transactions on Knowledge and Data Engineering
Suffix Trees on Words

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Frequent term-based text clustering

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Phrase-based Document Similarity Based on an Index Graph Model

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Computational dialectology in Irish Gaelic

EACL '95 Proceedings of the seventh conference on European chapter of the Association for Computational Linguistics
Automatic retrieval and clustering of similar words

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Efficient Phrase-Based Document Indexing for Web Document Clustering

IEEE Transactions on Knowledge and Data Engineering
A Concept-Driven Algorithm for Clustering Search Results

IEEE Intelligent Systems
Improving Web Clustering by Cluster Selection

WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
A new suffix tree similarity measure for document clustering

Proceedings of the 16th international conference on World Wide Web
Query Directed Web Page Clustering

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
The Google Similarity Distance

IEEE Transactions on Knowledge and Data Engineering
Search Results Clustering in Chinese Context Based on a New Suffix Tree

CITWORKSHOPS '08 Proceedings of the 2008 IEEE 8th International Conference on Computer and Information Technology Workshops
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Efficient Phrase-Based Document Similarity for Clustering

IEEE Transactions on Knowledge and Data Engineering
A survey of Web clustering engines

ACM Computing Surveys (CSUR)
Universal Mobile Information Retrieval

UAHCI '09 Proceedings of the 5th International on ConferenceUniversal Access in Human-Computer Interaction. Part II: Intelligent and Ubiquitous Interaction Environments
A New Suffix Tree Similarity Measure and Labeling for Web Search Results Clustering

ICETET '09 Proceedings of the 2009 Second International Conference on Emerging Trends in Engineering & Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Retrieving relevant information from web, containing enormous amount of data, is a highly complicated research area. A landmark research that contributes to this area is web clustering which efficiently organizes a large amount of web documents into a small number of meaningful and coherent groups[1,2]. Various techniques aim at accurately categorizing the web pages into clusters automatically. Suffix Tree Clustering (STC) is a phrase-based, state-of-art algorithm for web clustering that automatically groups semantically related documents based on shared phrases. Research has shown that it has outperformed other clustering algorithms such as K-means and Buckshot due to its efficient utilization of phrases to identify the clusters. Using STC as the baseline, we introduce a new method for ranking base clusters and new similarity measures for comparing clusters. Our STHAC technique combines the Heirarchical Agglomerative clustering method with phrase based Suffix Tree clustering to improve the cluster merging process. Experimental results have shown that STHAC outperforms the original STC as well as ESTC(our precious extended version of STC) with 16% increase in F-measure. This increase in F-measure of STHAC is achieved due to its better filtering of low score clusters, better similarity measures and efficient cluster merging algorithms.