An optimized k-means algorithm of reducing cluster intra-dissimilarity for document clustering

Authors:
Daling Wang;Ge Yu;Yubin Bao;Meng Zhang
Affiliations:
School of Information Science and Engineering, Northeastern University, Shenyang, P.R.China;School of Information Science and Engineering, Northeastern University, Shenyang, P.R.China;School of Information Science and Engineering, Northeastern University, Shenyang, P.R.China;School of Information Science and Engineering, Northeastern University, Shenyang, P.R.China
Venue:
WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
Year:
2005

Citing 8
Cited 1

Stemming and its effects on TFIDF ranking (poster session)

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Document Clustering Using the 1 + 1 Dimensional Self-Organising Map

IDEAL '02 Proceedings of the Third International Conference on Intelligent Data Engineering and Automated Learning
Feature Weighting in k-Means Clustering

Machine Learning
Hybrid Neural Document Clustering Using Guided Self-Organization and WordNet

IEEE Intelligent Systems
A Maximal Frequent Itemset Approach for Web Document Clustering

CIT '04 Proceedings of the The Fourth International Conference on Computer and Information Technology
An Immune Network Approach for Web Document Clustering

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Document clustering based on cluster validation

Proceedings of the thirteenth ACM international conference on Information and knowledge management
ESPClust: an effective skew prevention method for model-based document clustering

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing

Sociomapping in Text Retrieval Systems

FQAS '09 Proceedings of the 8th International Conference on Flexible Query Answering Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to the high-dimension and sparseness properties of documents, clustering the similar documents together is a tough task. The most popular document clustering method K-Means has the shortcoming of its cluster intra-dissimilarity, i.e. inclining to clustering unrelated documents together. One of the reasons is that all objects (documents) in a cluster produce the same influence to the mean of the cluster. SOM (Self Organizing Map) is a method to reduce the dimension of data and display the data in low dimension space, and it has been applied successfully to clustering of high-dimensional objects. The scalar factor is an important part of SOM. In this paper, an optimized K-Means algorithm is proposed. It introduces the scalar factor from SOM into means during K-Means assignment stage for controlling the influence to the means from new objects. Experiments show that the optimized K-Means algorithm has more F-Measure and less Entropy of clustering than standard K-Means algorithm, thereby reduces the intra-dissimilarity of clusters effectively.