Fast document clustering based on weighted comparative advantage

Authors:
Jie Ji;Tony Y. T. Chan;Qiangfu Zhao
Affiliations:
Intelligent System Lab, The University of Aizu, Aizuwakamatsu, Fukushima, Japan;School of Computing, The University of Akureyri, Iceland;Intelligent System Lab, The University of Aizu, Aizuwakamatsu, Fukushima, Japan
Venue:
SMC'09 Proceedings of the 2009 IEEE international conference on Systems, Man and Cybernetics
Year:
2009

Citing 6
Cited 0

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Concept decompositions for large sparse text data using clustering

Machine Learning
Clustering Algorithms

Clustering Algorithms
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Comparative Advantage Approach for Sparse Text Data Clustering

CIT '09 Proceedings of the 2009 Ninth IEEE International Conference on Computer and Information Technology - Volume 02

Quantified Score

Hi-index	0.00

Visualization

Abstract

Document clustering is the process of partitioning a set of unlabeled documents into clusters such that documents within each cluster share some common concepts. To help with this analysis, concepts are conveniently represented using some key terms. For clustering algorithm, the most costly CPU time has to do with the classification phase. Using words as features, text data are represented in a very high dimensional vector space. We have studied a comparative advantage based algorithm for clustering sparse data in this space, it used one "ruler" instead of k centers to identify the comparative advantage of each cluster and define the cluster label for each document. However, this algorithm only considered the relative strength between clusters, the relationship between terms was ignored. In this paper, we proposed a weighted comparative advantage based clustering algorithm. The experimental results based on SMART system databases show that the new algorithm is better than simple comparative advantage algorithm, without any extra computation time. Compare with k-means, not only can it get comparable results but it can also significantly accelerate the clustering procedure.