A novel hierarchical document clustering algorithm based on a kNN connection graph

Authors:
Qiaoming Zhu;Junhui Li;Guodong Zhou;Peifeng Li;Peide Qian
Affiliations:
School of Computer Science & Technology, Soochow University, Suzhou, China;School of Computer Science & Technology, Soochow University, Suzhou, China;School of Computer Science & Technology, Soochow University, Suzhou, China;School of Computer Science & Technology, Soochow University, Suzhou, China;School of Computer Science & Technology, Soochow University, Suzhou, China
Venue:
ICCPOL'06 Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead
Year:
2006

Citing 8
Cited 0

Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Bagging predictors

Machine Learning
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Data mining: concepts and techniques

Data mining: concepts and techniques
Evaluation of hierarchical clustering algorithms for document datasets

Proceedings of the eleventh international conference on Information and knowledge management
An Efficient k-Means Clustering Algorithm: Analysis and Implementation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Efficient Phrase-Based Document Indexing for Web Document Clustering

IEEE Transactions on Knowledge and Data Engineering
HHMM-based Chinese lexical analyzer ICTCLAS

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17

Quantified Score

Hi-index	0.00

Visualization

Abstract

Bottom-up hierarchical document clustering normally merges two most similar clusters in each step iteratively. This paper proposes a novel bottom-up hierarchical document clustering algorithm to merge several pairs of most similar clusters in each step. This is done via a concept of “kNN-connectedness”, which measures the mutual connectedness of clusters in kNNs, and a kNN connection graph, which organizes given clusters into several sets of kNN-connected clusters. In such a graph, a connection between any two clusters only exists in the kNN-connected clusters of the same set. Moreover, a new kNN-based attraction function is proposed to measure the similarity between two clusters and indicates the potential probability of the two clusters being merged. The attraction function only considers the relative distribution of their nearest neighbors between two clusters in a vector space while other criteria, such as the well-known cluster-based cosine similarity function, measures the absolute distance between two clusters. This makes the attraction function effectively apply to the cases where different clusters may have very different distance variation. In each step, a kNN connection graph, consisting of several sets of kNN-connected clusters, is first constructed from the given clusters using a kNN algorithm and the concept of “kNN-connectedness”. For each set of kNN-connected clusters, the attraction degree between any two clusters is calculated and several top connected cluster pairs will be merged. In this way, the iteration number can be largely reduced and the clustering process can be much speeded. Evaluation on a news document corpus shows that the kNN connection graph-based hierarchical document clustering algorithm can achieve better performance than the famous k-means clustering algorithm while reducing the iteration number sharply in comparison with normal hierarchical document clustering.