A new suffix tree similarity measure for document clustering

Authors:
Hung Chim;Xiaotie Deng
Affiliations:
City University of Hong Kong, Hong Kong;City University of Hong Kong, Hong Kong
Venue:
Proceedings of the 16th international conference on World Wide Web
Year:
2007

Citing 19
Cited 19

A new distance metric on strings computable in linear time

Discrete Applied Mathematics
Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
On the use of spreading activation methods in automatic information

SIGIR '88 Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
An interface for navigating clustered document sets returned by queries

COCS '93 Proceedings of the conference on Organizational computing systems
OHSUMED: an interactive retrieval evaluation and new large test collection for research

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical methods for speech recognition

Statistical methods for speech recognition
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Grouper: a dynamic clustering interface to Web search results

WWW '99 Proceedings of the eighth international conference on World Wide Web
A vector space model for automatic indexing

Communications of the ACM
Statistical Language Learning

Statistical Language Learning
Information Retrieval

Information Retrieval
Modern Information Retrieval

Modern Information Retrieval
MARSYAS: a framework for audio analysis

Organised Sound
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Efficient Phrase-Based Document Indexing for Web Document Clustering

IEEE Transactions on Knowledge and Data Engineering
A suffix tree approach to anti-spam email filtering

Machine Learning
A semantics based information distribution framework for large web-based course forum system

ICWL'06 Proceedings of the 5th international conference on Advances in Web Based Learning

Real-time data pre-processing technique for efficient feature extraction in large scale datasets

Proceedings of the 17th ACM conference on Information and knowledge management
Performance evaluation of similarity join for real time information integration

Proceedings of the 2nd Bangalore Annual Compute Conference
Full-Subtopic Retrieval with Keyphrase-Based Search Results Clustering

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
MatchSim: a novel neighbor-based similarity measure with maximum neighborhood matching

Proceedings of the 18th ACM conference on Information and knowledge management
PhraseRank for document clustering: reweighting the weight of phrase

Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human
Ranking weak-linked documents on the web

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 1
Web snippets clustering based on an improved suffix tree algorithm

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 1
Wiki trust metrics based on phrasal analysis

WikiSym '08 Proceedings of the 4th International Symposium on Wikis
Generating advertising keywords from video content

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Searching protein 3-D structures for optimal structure alignment using intelligent algorithms and data structures

IEEE Transactions on Information Technology in Biomedicine
Optimizing enterprise search by automatically relating user context to textual document content

i-KNOW '11 Proceedings of the 11th International Conference on Knowledge Management and Knowledge Technologies
ERA: efficient serial and parallel suffix tree construction for very long strings

Proceedings of the VLDB Endowment
ImpactWheel: Visual Analysis of the Impact of Online News

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Representing document as dependency graph for document clustering

Proceedings of the 20th ACM international conference on Information and knowledge management
Automatically structuring domain knowledge from text: An overview of current research

Information Processing and Management: an International Journal
Improving suffix tree clustering with new ranking and similarity measures

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part II
Extracting data records from web using suffix tree

Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics
Investigating usage of text segmentation and inter-passage similarities to improve text document clustering

MLDM'12 Proceedings of the 8th international conference on Machine Learning and Data Mining in Pattern Recognition
A Roadmap to Integrate Document Clustering in Information Retrieval

International Journal of Information Retrieval Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a new similarity measure to compute the pairwise similarity of text-based documents based on suffix tree document model. By applying the new suffix tree similarity measure in Group-average Agglomerative Hierarchical Clustering (GAHC) algorithm, we developed a new suffix tree document clustering algorithm (NSTC). Experimental results on two standard document clustering benchmark corpus OHSUMED and RCV1 indicate that the new clustering algorithm is a very effective document clustering algorithm. Comparing with the results of traditional word term weight tf-idf similarity measure in the same GAHC algorithm, NSTC achieved an improvement of 51% on the average of F-measure score. Furthermore, we apply the new clustering algorithm in analyzing the Web documents in online forum communities. A topic oriented clustering algorithm is developed to help people in assessing, classifying and searching the the Web documents in a large forum community.